- var slogan = "Build Tomorrow's Language Technologies"
- var tag_line = "spaCy – #{slogan}"
- var a_minor_miracle = 'a minor miracle'
mixin lede()
p.
spaCy is a library for industrial-strength NLP in Python and
Cython. It features state-of-the-art speed and accuracy, a concise API, and
great documentation. If you're a small company doing NLP, we want spaCy to
seem like !{a_minor_miracle}.
mixin overview()
p.
Overview text
mixin example()
p.
Example text
mixin benchmarks()
p.
Benchmarks
mixin get_started()
p.
Get Started
mixin example(name)
details
summary
span(class="example-name")= name
block
mixin accuracy_head
tr
mixin columns(...names)
tr
each name in names
th= name
mixin row(...cells)
tr
each cell in cells
td= cell
doctype html
html(lang="en")
head
meta(charset="utf-8")
title!= tag_line
meta(name="description" content="")
meta(name="author" content="Matthew Honnibal")
link(rel="stylesheet" href="css/style.css")
body(id="page" role="document")
header(role="banner")
h1(class="logo")!= tag_line
div(class="slogan")!= slogan
nav(role="navigation")
ul
li: a(href="#") Home
li: a(href="#") Docs
li: a(href="#") License
li: a(href="#") Blog
main(id="content" role="main")
section(class="intro")
+lede
nav(role="navigation")
ul
li: a(href="#overview" class="button") Examples
li: a(href="#overview" class="button") Comparisons
li: a(href="#example-use" class="button") Demo
li: a(href="#get-started" class="button") Install
article(class="page landing-page")
a(name="example-use"): h3 Usage by Example
+example("Load resources and process text")
pre.language-python
code
| from __future__ import unicode_literals, print_function
| from spacy.en import English
| nlp = English()
| doc = nlp('Hello, world. Here are two sentences.')
+example("Get tokens and sentences")
pre.language-python
code
| token = doc[0]
| sentence = doc.sents[0]
| assert token[0] is sentence[0]
+example("Use integer IDs for any string")
pre.language-python
code
| hello_id = nlp.vocab.strings['Hello']
| hello_str = nlp.vocab.strings[hello_id]
|
| assert token.orth == hello_id == 52
| assert token.orth_ == hello_str == 'Hello'
+example("Get and set string views and flags")
pre.language-python
code
| assert token.shape_ == 'Xxxx'
| for lexeme in nlp.vocab:
| if lexeme.is_alpha:
| lexeme.shape_ = 'W'
| elif lexeme.is_digit:
| lexeme.shape_ = 'D'
| elif lexeme.is_punct:
| lexeme.shape_ = 'P'
| else:
| lexeme.shape_ = 'M'
| assert token.shape_ == 'W'
+example("Export to numpy arrays")
pre.language-python
code
| Do me
+example("Word vectors")
pre.language-python
code
| Do me
+example("Part-of-speech tags")
pre.language-python
code
| Do me
+example("Syntactic dependencies")
pre.language-python
code
| Do me
+example("Named entities")
pre.language-python
code
| Do me
+example("Define custom NER rules")
pre.language-python
code
| Do me
+example("Calculate inline mark-up on original string")
pre.language-python
code
| Do me
+example("Efficient binary serialization")
pre.language-python
code
| Do me
a(name="benchmarks"): h3 Benchmarks
details
summary: h4 Independent Evaluation
p
| Independent evaluation by Yahoo! Labs and Emory
| University, to appear at ACL 2015. Higher is better.
table
thead
+columns("System", "Language", "Accuracy", "Speed")
tbody
+row("spaCy v0.86", "Cython", "91.9", "13,963")
+row("spaCy v0.84", "Cython", "90.6", "13,963")
+row("ClearNLP", "Java", "91.7", "10,271")
+row("CoreNLP", "Java", "89.6", "8,602")
+row("MATE", "Java", "92.5", "550")
+row("Turbo", "C++", "92.4", "349")
+row("Yara", "Java", "92.3", "340")
p
| Accuracy is % unlabelled arcs correct, speed is tokens per second.
p
| Joel Tetreault and Amanda Stent (Yahoo! Labs) and Jin-ho Choi (Emory)
| performed a detailed comparison of the best parsers available.
| All numbers above are taken from the pre-print they kindly made
| available to me, except for spaCy v0.86.
p
| I'm particularly grateful to the authors for discussion of their
| results, which led to the improvement in accuracy between v0.84 and
| v0.86. A tip from Jin-ho developer of ClearNLP) was particularly
| useful.
details
summary: h4 Detailed Accuracy Comparison
details
summary: h4 Detailed Speed Comparison
table
thead
tr
th.
th(colspan=3) Absolute (ms per doc)
th(colspan=3) Relative (to spaCy)
tbody
tr
td: strong System
td: strong Split
td: strong Tag
td: strong Parse
td: strong Split
td: strong Tag
td: strong Parse
+row("spaCy", "0.2ms", "1ms", "19ms", "1x", "1x", "1x")
+row("spaCy", "0.2ms", "1ms", "19ms", "1x", "1x", "1x")
+row("CoreNLP", "2ms", "10ms", "49ms", "10x", "10x", "2.6x")
+row("ZPar", "1ms", "8ms", "850ms", "5x", "8x", "44.7x")
+row("NLTK", "4ms", "443ms", "n/a", "20x", "443x", "n/a")
p
| Set up: 100,000 plain-text documents were streamed
| from an SQLite3 database, and processed with an NLP library, to one
| of three levels of detail – tokenization, tagging, or parsing.
| The tasks are additive: to parse the text you have to tokenize and
| tag it. The pre-processing was not subtracted from the times –
| I report the time required for the pipeline to complete. I report
| mean times per document, in milliseconds.
p
| Hardware: Intel i7-3770 (2012)
//+comparison("spaCy vs. NLTK")
//+comparison("spaCy vs. Pattern")
//+comparison("spaCy vs. CoreNLP")
//+comparison("spaCy vs. ClearNLP")
//+comparison("spaCy vs. OpenNLP")
//+comparison("spaCy vs. GATE")
a(name="get-started"): h3 Get started
+get_started
footer(role="contentinfo")
script(src="js/prism.js")