💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label
This commit is contained in:
Ines Montani 2018-04-29 02:06:46 +02:00 committed by GitHub
parent 3c80f69ff5
commit 49cee4af92
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
75 changed files with 2930 additions and 1894 deletions

View File

@ -49,7 +49,7 @@ integration. It's commercial open-source software, released under the MIT licens
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
`API Reference`_ The detailed reference for spaCy's API.
`Models`_ Download statistical language models for spaCy.
`Resources`_ Libraries, extensions, demos, books and courses.
`Universe`_ Libraries, extensions, demos, books and courses.
`Changelog`_ Changes and version history.
`Contribute`_ How to contribute to the spaCy project and code base.
=================== ===
@ -59,7 +59,7 @@ integration. It's commercial open-source software, released under the MIT licens
.. _Usage Guides: https://spacy.io/usage/
.. _API Reference: https://spacy.io/api/
.. _Models: https://spacy.io/models
.. _Resources: https://spacy.io/usage/resources
.. _Universe: https://spacy.io/universe
.. _Changelog: https://spacy.io/usage/#changelog
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md

View File

@ -1,6 +1,6 @@
#!/usr/bin/env python
# coding: utf8
"""Train a multi-label convolutional neural network text classifier on the
"""Train a convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`. For more details,

View File

@ -1,7 +1,7 @@
{
"globals": {
"title": "spaCy",
"description": "spaCy is a free open-source library featuring state-of-the-art speed and accuracy and a powerful Python API.",
"description": "spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.",
"SITENAME": "spaCy",
"SLOGAN": "Industrial-strength Natural Language Processing in Python",
@ -10,10 +10,13 @@
"COMPANY": "Explosion AI",
"COMPANY_URL": "https://explosion.ai",
"DEMOS_URL": "https://demos.explosion.ai",
"DEMOS_URL": "https://explosion.ai/demos",
"MODELS_REPO": "explosion/spacy-models",
"KERNEL_BINDER": "ines/spacy-binder",
"KERNEL_PYTHON": "python3",
"SPACY_VERSION": "2.0",
"BINDER_VERSION": "2.0.11",
"SOCIAL": {
"twitter": "spacy_io",
@ -26,7 +29,8 @@
"NAVIGATION": {
"Usage": "/usage",
"Models": "/models",
"API": "/api"
"API": "/api",
"Universe": "/universe"
},
"FOOTER": {
@ -34,7 +38,7 @@
"Usage": "/usage",
"Models": "/models",
"API Reference": "/api",
"Resources": "/usage/resources"
"Universe": "/universe"
},
"Support": {
"Issue Tracker": "https://github.com/explosion/spaCy/issues",
@ -82,8 +86,8 @@
}
],
"V_CSS": "2.0.1",
"V_JS": "2.0.1",
"V_CSS": "2.1.1",
"V_JS": "2.1.0",
"DEFAULT_SYNTAX": "python",
"ANALYTICS": "UA-58931649-1",
"MAILCHIMP": {

View File

@ -15,12 +15,39 @@
- MODEL_META = public.models._data.MODEL_META
- MODEL_LICENSES = public.models._data.MODEL_LICENSES
- MODEL_BENCHMARKS = public.models._data.MODEL_BENCHMARKS
- EXAMPLE_SENT_LANGS = public.models._data.EXAMPLE_SENT_LANGS
- EXAMPLE_SENTENCES = public.models._data.EXAMPLE_SENTENCES
- IS_PAGE = (SECTION != "index") && !landing
- IS_MODELS = (SECTION == "models" && LANGUAGES[current.source])
- HAS_MODELS = IS_MODELS && CURRENT_MODELS.length
//- Get page URL
- function getPageUrl() {
- var path = current.path;
- if(path[path.length - 1] == 'index') path = path.slice(0, path.length - 1);
- return `${SITE_URL}/${path.join('/')}`;
- }
//- Get pretty page title depending on section
- function getPageTitle() {
- var sections = ['api', 'usage', 'models'];
- if (sections.includes(SECTION)) {
- var titleSection = (SECTION == "api") ? 'API' : SECTION.charAt(0).toUpperCase() + SECTION.slice(1);
- return `${title} · ${SITENAME} ${titleSection} Documentation`;
- }
- else if (SECTION != 'index') return `${title} · ${SITENAME}`;
- return `${SITENAME} · ${SLOGAN}`;
- }
//- Get social image based on section and settings
- function getPageImage() {
- var img = (SECTION == 'api') ? 'api' : 'default';
- return `${SITE_URL}/assets/img/social/preview_${preview || img}.jpg`;
- }
//- Add prefixes to items of an array (for modifier CSS classes)
array - [array] list of class names or options, e.g. ["foot"]

View File

@ -7,7 +7,7 @@ include _functions
id - [string] anchor assigned to section (used for breadcrumb navigation)
mixin section(id)
section.o-section(id="section-" + id data-section=id)
section.o-section(id=id ? "section-" + id : null data-section=id)&attributes(attributes)
block
@ -143,7 +143,7 @@ mixin aside-wrapper(label, emoji)
mixin aside(label, emoji)
+aside-wrapper(label, emoji)
.c-aside__text.u-text-small
.c-aside__text.u-text-small&attributes(attributes)
block
@ -154,7 +154,7 @@ mixin aside(label, emoji)
prompt - [string] prompt displayed before first line, e.g. "$"
mixin aside-code(label, language, prompt)
+aside-wrapper(label)
+aside-wrapper(label)&attributes(attributes)
+code(false, language, prompt).o-no-block
block
@ -165,7 +165,7 @@ mixin aside-code(label, language, prompt)
argument to be able to wrap it for spacing
mixin infobox(label, emoji)
aside.o-box.o-block.u-text-small
aside.o-box.o-block.u-text-small&attributes(attributes)
if label
h3.u-heading.u-text-label.u-color-theme
if emoji
@ -242,7 +242,9 @@ mixin button(url, trusted, ...style)
wrap - [boolean] wrap text and disable horizontal scrolling
mixin code(label, language, prompt, height, icon, wrap)
pre.c-code-block.o-block(class="lang-#{(language || DEFAULT_SYNTAX)}" class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
- var lang = (language != "none") ? (language || DEFAULT_SYNTAX) : null
- var lang_class = (language != "none") ? "lang-" + (language || DEFAULT_SYNTAX) : null
pre.c-code-block.o-block(data-language=lang class=lang_class class=icon ? "c-code-block--has-icon" : null style=height ? "height: #{height}px" : null)&attributes(attributes)
if label
h4.u-text-label.u-text-label--dark=label
if icon
@ -253,6 +255,15 @@ mixin code(label, language, prompt, height, icon, wrap)
code.c-code-block__content(class=wrap ? "u-wrap" : null data-prompt=prompt)
block
//- Executable code
mixin code-exec(label, large)
- label = (label || "Editable code example") + " (experimental)"
+terminal-wrapper(label, !large)
figure.thebelab-wrapper
span.thebelab-wrapper__text.u-text-tiny v#{BINDER_VERSION} · Python 3 · via #[+a("https://mybinder.org/").u-hide-link Binder]
+code(data-executable="true")&attributes(attributes)
block
//- Wrapper for code blocks to display old/new versions
@ -658,12 +669,16 @@ mixin qs(data, style)
//- Terminal-style code window
label - [string] title displayed in top bar of terminal window
mixin terminal(label, button_text, button_url)
.x-terminal
.x-terminal__icons: span
.u-padding-small.u-text-label.u-text-center=label
mixin terminal-wrapper(label, small)
.x-terminal(class=small ? "x-terminal--small" : null)
.x-terminal__icons(class=small ? "x-terminal__icons--small" : null): span
.u-padding-small.u-text-center(class=small ? "u-text-tiny" : "u-text")
strong=label
block
+code.x-terminal__code
mixin terminal(label, button_text, button_url, exec)
+terminal-wrapper(label)
+code.x-terminal__code(data-executable=exec ? "" : null)
block
if button_text && button_url

View File

@ -10,10 +10,7 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
li.c-nav__menu__item(class=is_active ? "is-active" : null)
+a(url)(tabindex=is_active ? "-1" : null)=item
li.c-nav__menu__item.u-hidden-xs
+a("https://survey.spacy.io", true) User Survey 2018
li.c-nav__menu__item.u-hidden-xs
li.c-nav__menu__item
+a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]
progress.c-progress.js-progress(value="0" max="1")

View File

@ -1,7 +1,9 @@
//- 💫 INCLUDES > MODELS PAGE TEMPLATE
for id in CURRENT_MODELS
- var comps = getModelComponents(id)
+section(id)
section(data-vue=id data-model=id)
+grid("vcenter").o-no-block(id=id)
+grid-col("two-thirds")
+h(2)
@ -9,24 +11,21 @@ for id in CURRENT_MODELS
+grid-col("third").u-text-right
.u-color-subtle.u-text-tiny
+button(gh("spacy-models") + "/releases", true, "secondary", "small")(data-tpl=id data-tpl-key="download")
+button(gh("spacy-models") + "/releases", true, "secondary", "small")(v-bind:href="releaseUrl")
| Release details
.u-padding-small Latest: #[code(data-tpl=id data-tpl-key="version") n/a]
.u-padding-small Latest: #[code(v-text="version") n/a]
+aside-code("Installation", "bash", "$").
python -m spacy download #{id}
- var comps = getModelComponents(id)
p(v-if="description" v-text="description")
p(data-tpl=id data-tpl-key="description")
div(data-tpl=id data-tpl-key="error")
+infobox
+infobox(v-if="error")
| Unable to load model details from GitHub. To find out more
| about this model, see the overview of the
| #[+a(gh("spacy-models") + "/releases") latest model releases].
+table.o-block-small(data-tpl=id data-tpl-key="table")
+table.o-block-small(v-bind:data-loading="loading")
+row
+cell #[+label Language]
+cell #[+tag=comps.lang] #{LANGUAGES[comps.lang]}
@ -36,42 +35,76 @@ for id in CURRENT_MODELS
+cell #[+tag=comp] #{MODEL_META[comp]}
+row
+cell #[+label Size]
+cell #[+tag=comps.size] #[span(data-tpl=id data-tpl-key="size") #[em n/a]]
+cell #[+tag=comps.size] #[span(v-text="sizeFull" v-if="sizeFull")] #[em(v-else="") n/a]
each label in ["Pipeline", "Vectors", "Sources", "Author", "License"]
- var field = label.toLowerCase()
if field == "vectors"
- field = "vecs"
+row(v-if="pipeline && pipeline.length" v-cloak="")
+cell
+label Pipeline #[+help(MODEL_META.pipeline).u-color-subtle]
+cell
span(v-for="(pipe, index) in pipeline" v-if="pipeline")
code(v-text="pipe")
span(v-if="index != pipeline.length - 1") , 
+row(v-if="vectors" v-cloak="")
+cell
+label Vectors #[+help(MODEL_META.vectors).u-color-subtle]
+cell(v-text="vectors")
+row(v-if="sources && sources.length" v-cloak="")
+cell
+label Sources #[+help(MODEL_META.sources).u-color-subtle]
+cell
span(v-for="(source, index) in sources") {{ source }}
span(v-if="index != sources.length - 1") , 
+row(v-if="author" v-cloak="")
+cell #[+label Author]
+cell
+a("")(v-bind:href="url" v-if="url" v-text="author")
span(v-else="" v-text="author") {{ model.author }}
+row(v-if="license" v-cloak="")
+cell #[+label License]
+cell
+a("")(v-bind:href="modelLicenses[license]" v-if="modelLicenses[license]") {{ license }}
span(v-else="") {{ license }}
+row(v-cloak="")
+cell #[+label Compat #[+help(MODEL_META.compat).u-color-subtle]]
+cell
.o-field.u-float-left
select.o-field__select.u-text-small(v-model="spacyVersion")
option(v-for="version in orderedCompat" v-bind:value="version") spaCy v{{ version }}
code(v-if="compatVersion" v-text="compatVersion")
em(v-else="") not compatible
+grid.o-block-small(v-cloak="" v-if="hasAccuracy")
for keys, label in MODEL_BENCHMARKS
.u-flex-full.u-padding-small
+table.o-block-small
+row("head")
+head-cell(colspan="2")=(MODEL_META["benchmark_" + label] || label)
for label, field in keys
+row
+cell.u-nowrap
+label=label
if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle]
+cell
span(data-tpl=id data-tpl-key=field) #[em n/a]
+cell("num")
span(v-if="#{field}" v-text="#{field}")
em(v-if="!#{field}") n/a
+row(data-tpl=id data-tpl-key="compat-wrapper" hidden="")
+cell
+label Compat #[+help("Latest compatible model version for your spaCy installation").u-color-subtle]
+cell
.o-field.u-float-left
select.o-field__select.u-text-small(data-tpl=id data-tpl-key="compat")
div(data-tpl=id data-tpl-key="compat-versions")  
p.u-text-small.u-color-dark(v-if="notes" v-text="notes" v-cloak="")
section(data-tpl=id data-tpl-key="benchmarks" hidden="")
+grid.o-block-small
for keys, label in MODEL_BENCHMARKS
.u-flex-full.u-padding-small(data-tpl=id data-tpl-key=label.toLowerCase() hidden="")
+table.o-block-small
+row("head")
+head-cell(colspan="2")=(MODEL_META["benchmark_" + label] || label)
for label, field in keys
+row(hidden="")
+cell.u-nowrap
+label=label
if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle]
+cell("num")(data-tpl=id data-tpl-key=field)
| n/a
if comps.size == "sm" && EXAMPLE_SENT_LANGS.includes(comps.lang)
section
+code-exec("Test the model live").
import spacy
from spacy.lang.#{comps.lang}.examples import sentences
nlp = spacy.load('#{id}')
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
print(token.text, token.pos_, token.dep_)
p.u-text-small.u-color-dark(data-tpl=id data-tpl-key="notes")

View File

@ -1,86 +1,33 @@
//- 💫 INCLUDES > SCRIPTS
if quickstart
script(src="/assets/js/vendor/quickstart.min.js")
if IS_PAGE || SECTION == "index"
script(type="text/x-thebe-config")
| { bootstrap: true, binderOptions: { repo: "#{KERNEL_BINDER}"},
| kernelOptions: { name: "#{KERNEL_PYTHON}" }}
if IS_PAGE
script(src="/assets/js/vendor/in-view.min.js")
- scripts = ["vendor/prism.min", "vendor/vue.min"]
- if (SECTION == "universe") scripts.push("vendor/vue-markdown.min")
- if (quickstart) scripts.push("vendor/quickstart.min")
- if (IS_PAGE) scripts.push("vendor/in-view.min")
- if (IS_PAGE || SECTION == "index") scripts.push("vendor/thebelab.custom.min")
for script in scripts
script(src="/assets/js/" + script + ".js")
script(src="/assets/js/main.js?v#{V_JS}" type=(environment == "deploy") ? null : "module")
if environment == "deploy"
script(async src="https://www.google-analytics.com/analytics.js")
script(src="/assets/js/vendor/prism.min.js")
if compare_models
script(src="/assets/js/vendor/chart.min.js")
script(src="https://www.google-analytics.com/analytics.js", async)
script
if quickstart
| new Quickstart("#qs");
if environment == "deploy"
| window.ga=window.ga||function(){
| (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date;
| ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview');
if IS_PAGE
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
script
| ((window.gitter = {}).chat = {}).options = {
| useStyles: false,
| activationElement: '.js-gitter-button',
| targetElement: '.js-gitter',
| room: '!{SOCIAL.gitter}'
| };
if IS_PAGE
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
//- JS modules slightly hacky, but necessary to dynamically instantiate the
classes with data from the Harp JSON files, while still being able to
support older browsers that can't handle JS modules. More details:
https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
- ProgressBar = "new ProgressBar('.js-progress');"
- Accordion = "new Accordion('.js-accordion');"
- Changelog = "new Changelog('" + SOCIAL.github + "', 'spacy');"
- NavHighlighter = "new NavHighlighter('data-section', 'data-nav');"
- GitHubEmbed = "new GitHubEmbed('" + SOCIAL.github + "', 'data-gh-embed');"
- ModelLoader = "new ModelLoader('" + MODELS_REPO + "'," + JSON.stringify(CURRENT_MODELS) + "," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + ");"
- ModelComparer = "new ModelComparer('" + MODELS_REPO + "'," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + "," + JSON.stringify(LANGUAGES) + "," + JSON.stringify(MODEL_META) + "," + JSON.stringify(default_models || false) + ");"
if environment == "deploy"
//- DEPLOY: use compiled rollup.js and instantiate classes directly
script(src="/assets/js/rollup.js?v#{V_JS}")
script
!=ProgressBar
if changelog
!=Changelog
if IS_PAGE
!=NavHighlighter
!=GitHubEmbed
!=Accordion
if HAS_MODELS
!=ModelLoader
if compare_models
!=ModelComparer
else
//- DEVELOPMENT: Use ES6 modules
script(type="module")
| import ProgressBar from '/assets/js/progress.js';
!=ProgressBar
if changelog
| import Changelog from '/assets/js/changelog.js';
!=Changelog
if IS_PAGE
| import NavHighlighter from '/assets/js/nav-highlighter.js';
!=NavHighlighter
| import GitHubEmbed from '/assets/js/github-embed.js';
!=GitHubEmbed
| import Accordion from '/assets/js/accordion.js';
!=Accordion
if HAS_MODELS
| import { ModelLoader } from '/assets/js/models.js';
!=ModelLoader
if compare_models
| import { ModelComparer } from '/assets/js/models.js';
!=ModelComparer

View File

@ -7,6 +7,12 @@ svg(style="position: absolute; visibility: hidden; width: 0; height: 0;" width="
symbol#svg_github(viewBox="0 0 27 32")
path(d="M13.714 2.286q3.732 0 6.884 1.839t4.991 4.991 1.839 6.884q0 4.482-2.616 8.063t-6.759 4.955q-0.482 0.089-0.714-0.125t-0.232-0.536q0-0.054 0.009-1.366t0.009-2.402q0-1.732-0.929-2.536 1.018-0.107 1.83-0.321t1.679-0.696 1.446-1.188 0.946-1.875 0.366-2.688q0-2.125-1.411-3.679 0.661-1.625-0.143-3.643-0.5-0.161-1.446 0.196t-1.643 0.786l-0.679 0.429q-1.661-0.464-3.429-0.464t-3.429 0.464q-0.286-0.196-0.759-0.482t-1.491-0.688-1.518-0.241q-0.804 2.018-0.143 3.643-1.411 1.554-1.411 3.679 0 1.518 0.366 2.679t0.938 1.875 1.438 1.196 1.679 0.696 1.83 0.321q-0.696 0.643-0.875 1.839-0.375 0.179-0.804 0.268t-1.018 0.089-1.17-0.384-0.991-1.116q-0.339-0.571-0.866-0.929t-0.884-0.429l-0.357-0.054q-0.375 0-0.518 0.080t-0.089 0.205 0.161 0.25 0.232 0.214l0.125 0.089q0.393 0.179 0.777 0.679t0.563 0.911l0.179 0.411q0.232 0.679 0.786 1.098t1.196 0.536 1.241 0.125 0.991-0.063l0.411-0.071q0 0.679 0.009 1.58t0.009 0.973q0 0.321-0.232 0.536t-0.714 0.125q-4.143-1.375-6.759-4.955t-2.616-8.063q0-3.732 1.839-6.884t4.991-4.991 6.884-1.839zM5.196 21.982q0.054-0.125-0.125-0.214-0.179-0.054-0.232 0.036-0.054 0.125 0.125 0.214 0.161 0.107 0.232-0.036zM5.75 22.589q0.125-0.089-0.036-0.286-0.179-0.161-0.286-0.054-0.125 0.089 0.036 0.286 0.179 0.179 0.286 0.054zM6.286 23.393q0.161-0.125 0-0.339-0.143-0.232-0.304-0.107-0.161 0.089 0 0.321t0.304 0.125zM7.036 24.143q0.143-0.143-0.071-0.339-0.214-0.214-0.357-0.054-0.161 0.143 0.071 0.339 0.214 0.214 0.357 0.054zM8.054 24.589q0.054-0.196-0.232-0.286-0.268-0.071-0.339 0.125t0.232 0.268q0.268 0.107 0.339-0.107zM9.179 24.679q0-0.232-0.304-0.196-0.286 0-0.286 0.196 0 0.232 0.304 0.196 0.286 0 0.286-0.196zM10.214 24.5q-0.036-0.196-0.321-0.161-0.286 0.054-0.25 0.268t0.321 0.143 0.25-0.25z")
symbol#svg_twitter(viewBox="0 0 30 32")
path(d="M28.929 7.286q-1.196 1.75-2.893 2.982 0.018 0.25 0.018 0.75 0 2.321-0.679 4.634t-2.063 4.437-3.295 3.759-4.607 2.607-5.768 0.973q-4.839 0-8.857-2.589 0.625 0.071 1.393 0.071 4.018 0 7.161-2.464-1.875-0.036-3.357-1.152t-2.036-2.848q0.589 0.089 1.089 0.089 0.768 0 1.518-0.196-2-0.411-3.313-1.991t-1.313-3.67v-0.071q1.214 0.679 2.607 0.732-1.179-0.786-1.875-2.054t-0.696-2.75q0-1.571 0.786-2.911 2.161 2.661 5.259 4.259t6.634 1.777q-0.143-0.679-0.143-1.321 0-2.393 1.688-4.080t4.080-1.688q2.5 0 4.214 1.821 1.946-0.375 3.661-1.393-0.661 2.054-2.536 3.179 1.661-0.179 3.321-0.893z")
symbol#svg_website(viewBox="0 0 32 32")
path(d="M22.658 10.988h5.172c0.693 1.541 1.107 3.229 1.178 5.012h-5.934c-0.025-1.884-0.181-3.544-0.416-5.012zM20.398 3.896c2.967 1.153 5.402 3.335 6.928 6.090h-4.836c-0.549-2.805-1.383-4.799-2.092-6.090zM16.068 9.986v-6.996c1.066 0.047 2.102 0.216 3.092 0.493 0.75 1.263 1.719 3.372 2.33 6.503h-5.422zM9.489 22.014c-0.234-1.469-0.396-3.119-0.421-5.012h5.998v5.012h-5.577zM9.479 10.988h5.587v5.012h-6.004c0.025-1.886 0.183-3.543 0.417-5.012zM11.988 3.461c0.987-0.266 2.015-0.435 3.078-0.469v6.994h-5.422c0.615-3.148 1.591-5.265 2.344-6.525zM3.661 9.986c1.551-2.8 4.062-4.993 7.096-6.131-0.715 1.29-1.559 3.295-2.114 6.131h-4.982zM8.060 16h-6.060c0.066-1.781 0.467-3.474 1.158-5.012h5.316c-0.233 1.469-0.39 3.128-0.414 5.012zM8.487 22.014h-5.29c-0.694-1.543-1.139-3.224-1.204-5.012h6.071c0.024 1.893 0.188 3.541 0.423 5.012zM8.651 23.016c0.559 2.864 1.416 4.867 2.134 6.142-3.045-1.133-5.557-3.335-7.11-6.142h4.976zM15.066 23.016v6.994c-1.052-0.033-2.067-0.199-3.045-0.46-0.755-1.236-1.736-3.363-2.356-6.534h5.401zM21.471 23.016c-0.617 3.152-1.592 5.271-2.344 6.512-0.979 0.271-2.006 0.418-3.059 0.465v-6.977h5.403zM16.068 17.002h5.998c-0.023 1.893-0.188 3.542-0.422 5.012h-5.576v-5.012zM22.072 16h-6.004v-5.012h5.586c0.235 1.469 0.393 3.126 0.418 5.012zM23.070 17.002h5.926c-0.066 1.787-0.506 3.468-1.197 5.012h-5.152c0.234-1.471 0.398-3.119 0.423-5.012zM27.318 23.016c-1.521 2.766-3.967 4.949-6.947 6.1 0.715-1.276 1.561-3.266 2.113-6.1h4.834z")
symbol#svg_code(viewBox="0 0 20 20")
path(d="M5.719 14.75c-0.236 0-0.474-0.083-0.664-0.252l-5.060-4.498 5.341-4.748c0.412-0.365 1.044-0.33 1.411 0.083s0.33 1.045-0.083 1.412l-3.659 3.253 3.378 3.002c0.413 0.367 0.45 0.999 0.083 1.412-0.197 0.223-0.472 0.336-0.747 0.336zM14.664 14.748l5.341-4.748-5.060-4.498c-0.413-0.367-1.045-0.33-1.411 0.083s-0.33 1.045 0.083 1.412l3.378 3.003-3.659 3.252c-0.413 0.367-0.45 0.999-0.083 1.412 0.197 0.223 0.472 0.336 0.747 0.336 0.236 0 0.474-0.083 0.664-0.252zM9.986 16.165l2-12c0.091-0.545-0.277-1.060-0.822-1.151-0.547-0.092-1.061 0.277-1.15 0.822l-2 12c-0.091 0.545 0.277 1.060 0.822 1.151 0.056 0.009 0.11 0.013 0.165 0.013 0.48 0 0.904-0.347 0.985-0.835z")

View File

@ -3,23 +3,15 @@
include _includes/_mixins
- title = IS_MODELS ? LANGUAGES[current.source] || title : title
- social_title = (SECTION == "index") ? SITENAME + " - " + SLOGAN : title + " - " + SITENAME
- social_img = SITE_URL + "/assets/img/social/preview_" + (preview || ALPHA ? "alpha" : "default") + ".jpg"
- PAGE_URL = getPageUrl()
- PAGE_TITLE = getPageTitle()
- PAGE_IMAGE = getPageImage()
doctype html
html(lang="en")
head
title
if SECTION == "api" || SECTION == "usage" || SECTION == "models"
- var title_section = (SECTION == "api") ? "API" : SECTION.charAt(0).toUpperCase() + SECTION.slice(1)
| #{title} | #{SITENAME} #{title_section} Documentation
else if SECTION != "index"
| #{title} | #{SITENAME}
else
| #{SITENAME} - #{SLOGAN}
title=PAGE_TITLE
meta(charset="utf-8")
meta(name="viewport" content="width=device-width, initial-scale=1.0")
meta(name="referrer" content="always")
@ -27,23 +19,24 @@ html(lang="en")
meta(property="og:type" content="website")
meta(property="og:site_name" content=sitename)
meta(property="og:url" content="#{SITE_URL}/#{current.path.join('/')}")
meta(property="og:title" content=social_title)
meta(property="og:url" content=PAGE_URL)
meta(property="og:title" content=PAGE_TITLE)
meta(property="og:description" content=description)
meta(property="og:image" content=social_img)
meta(property="og:image" content=PAGE_IMAGE)
meta(name="twitter:card" content="summary_large_image")
meta(name="twitter:site" content="@" + SOCIAL.twitter)
meta(name="twitter:title" content=social_title)
meta(name="twitter:title" content=PAGE_TITLE)
meta(name="twitter:description" content=description)
meta(name="twitter:image" content=social_img)
meta(name="twitter:image" content=PAGE_IMAGE)
link(rel="shortcut icon" href="/assets/img/favicon.ico")
link(rel="icon" type="image/x-icon" href="/assets/img/favicon.ico")
if SECTION == "api"
link(href="/assets/css/style_green.css?v#{V_CSS}" rel="stylesheet")
else if SECTION == "universe"
link(href="/assets/css/style_purple.css?v#{V_CSS}" rel="stylesheet")
else
link(href="/assets/css/style.css?v#{V_CSS}" rel="stylesheet")
@ -54,6 +47,9 @@ html(lang="en")
if !landing
include _includes/_page-docs
else if SECTION == "universe"
!=yield
else
main!=yield
include _includes/_footer

View File

@ -1,5 +1,13 @@
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
p
| spaCy's statistical models have been custom-designed to give a
| high-performance mix of speed and accuracy. The current architecture
| hasn't been published yet, but in the meantime we prepared a video that
| explains how the models work, with particular focus on NER.
+youtube("sqDHBH9IjRU")
p
| The parsing model is a blend of recent results. The two recent
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
@ -44,7 +52,7 @@ p
+cell First two words of the buffer.
+row
+cell.u-nowrap
+cell
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
| #[code B1L1]#[br]
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
@ -54,7 +62,7 @@ p
| #[code S2], #[code B0] and #[code B1].
+row
+cell.u-nowrap
+cell
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
| #[code B1R1]#[br]
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],

View File

@ -60,6 +60,13 @@
padding-bottom: 4rem
border-bottom: 1px dotted $color-subtle
&.o-section--small
overflow: auto
&:not(:last-child)
margin-bottom: 3.5rem
padding-bottom: 2rem
.o-block
margin-bottom: 4rem
@ -142,6 +149,14 @@
.o-badge
border-radius: 1em
.o-thumb
@include size(100px)
overflow: hidden
border-radius: 50%
&.o-thumb--small
@include size(35px)
//- SVG

View File

@ -103,6 +103,9 @@
&:hover
color: $color-theme-dark
.u-hand
cursor: pointer
.u-hide-link.u-hide-link
border: none
color: inherit
@ -224,6 +227,7 @@
$spinner-size: 75px
$spinner-bar: 8px
min-height: $spinner-size * 2
position: relative
& > *
@ -245,10 +249,19 @@
//- Hidden elements
.u-hidden
display: none
.u-hidden,
[v-cloak]
display: none !important
@each $breakpoint in (xs, sm, md)
.u-hidden-#{$breakpoint}.u-hidden-#{$breakpoint}
@include breakpoint(max, $breakpoint)
display: none
//- Transitions
.u-fade-enter-active
transition: opacity 0.5s
.u-fade-enter
opacity: 0

View File

@ -2,7 +2,8 @@
//- Code block
.c-code-block
.c-code-block,
.thebelab-cell
background: $color-front
color: darken($color-back, 20)
padding: 0.75em 0
@ -13,7 +14,7 @@
white-space: pre
direction: ltr
&.c-code-block--has-icon
.c-code-block--has-icon
padding: 0
display: flex
border-top-left-radius: 0
@ -28,26 +29,66 @@
&.c-code-block__icon--border
border-left: 6px solid
//- Code block content
.c-code-block__content
.c-code-block__content,
.thebelab-input,
.jp-OutputArea
display: block
font: normal normal 1.1rem/#{1.9} $font-code
padding: 1em 2em
&[data-prompt]:before,
.c-code-block__content[data-prompt]:before,
content: attr(data-prompt)
margin-right: 0.65em
display: inline-block
vertical-align: middle
opacity: 0.5
//- Thebelab
[data-executable]
margin-bottom: 0
.thebelab-input.thebelab-input
padding: 3em 2em 1em
.jp-OutputArea
&:not(:empty)
padding: 2rem 2rem 1rem
border-top: 1px solid $color-dark
margin-top: 2rem
.entities, svg
white-space: initial
font-family: inherit
.entities
font-size: 1.35rem
.jp-OutputArea pre
font: inherit
.jp-OutputPrompt.jp-OutputArea-prompt
padding-top: 0.5em
margin-right: 1rem
font-family: inherit
font-weight: bold
.thebelab-run-button
@extend .u-text-label, .u-text-label--dark
.thebelab-wrapper
position: relative
.thebelab-wrapper__text
@include position(absolute, top, right, 1.25rem, 1.25rem)
color: $color-subtle-dark
z-index: 10
//- Code
code
code, .CodeMirror, .jp-RenderedText, .jp-OutputArea
-webkit-font-smoothing: subpixel-antialiased
-moz-osx-font-smoothing: auto
@ -73,7 +114,7 @@ code
text-shadow: none
//- Syntax Highlighting
//- Syntax Highlighting (Prism)
[class*="language-"] .token
&.comment, &.prolog, &.doctype, &.cdata, &.punctuation
@ -103,3 +144,50 @@ code
&.italic
font-style: italic
//- Syntax Highlighting (CodeMirror)
.CodeMirror.cm-s-default
background: $color-front
color: darken($color-back, 20)
.CodeMirror-selected
background: $color-theme
color: $color-back
.CodeMirror-cursor
border-left-color: currentColor
.cm-variable-2
color: inherit
font-style: italic
.cm-comment
color: map-get($syntax-highlighting, comment)
.cm-keyword, .cm-builtin
color: map-get($syntax-highlighting, keyword)
.cm-operator
color: map-get($syntax-highlighting, operator)
.cm-string
color: map-get($syntax-highlighting, selector)
.cm-number
color: map-get($syntax-highlighting, number)
.cm-def
color: map-get($syntax-highlighting, function)
//- Syntax highlighting (Jupyter)
.jp-RenderedText pre
.ansi-cyan-fg
color: map-get($syntax-highlighting, function)
.ansi-green-fg
color: $color-green
.ansi-red-fg
color: map-get($syntax-highlighting, operator)

View File

@ -8,6 +8,12 @@
width: 100%
position: relative
&.x-terminal--small
background: $color-dark
color: $color-subtle
border-radius: 4px
margin-bottom: 4rem
.x-terminal__icons
position: absolute
padding: 10px
@ -32,6 +38,13 @@
content: ""
background: $color-yellow
&.x-terminal__icons--small
&:before,
&:after,
span
@include size(10px)
.x-terminal__code
margin: 0
border: none

View File

@ -9,7 +9,7 @@
display: flex
justify-content: space-between
flex-flow: row nowrap
padding: 0 2rem 0 1rem
padding: 0 0 0 1rem
z-index: 30
width: 100%
box-shadow: $box-shadow
@ -21,11 +21,20 @@
.c-nav__menu
@include size(100%)
display: flex
justify-content: flex-end
flex-flow: row nowrap
border-color: inherit
flex: 1
@include breakpoint(max, sm)
@include scroll-shadow-base($color-front)
overflow-x: auto
overflow-y: hidden
-webkit-overflow-scrolling: touch
@include breakpoint(min, md)
justify-content: flex-end
.c-nav__menu__item
display: flex
align-items: center
@ -39,6 +48,14 @@
&:not(:first-child)
margin-left: 2em
&:last-child
@include scroll-shadow-cover(right, $color-back)
padding-right: 2rem
&:first-child
@include scroll-shadow-cover(left, $color-back)
padding-left: 2rem
&.is-active
color: $color-dark
pointer-events: none

View File

@ -26,7 +26,7 @@ $font-code: Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace
// Colors
$colors: ( blue: #09a3d5, green: #05b083 )
$colors: ( blue: #09a3d5, green: #05b083, purple: #6542d1 )
$color-back: #fff !default
$color-front: #1a1e23 !default

View File

@ -0,0 +1,4 @@
//- 💫 STYLESHEET (PURPLE)
$theme: purple
@import style

Binary file not shown.

After

Width:  |  Height:  |  Size: 204 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 144 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

View File

@ -1,25 +0,0 @@
'use strict';
import { $$ } from './util.js';
export default class Accordion {
/**
* Simple, collapsible accordion sections.
* Inspired by: https://inclusive-components.design/collapsible-sections/
* @param {string} selector - Query selector of button element.
*/
constructor(selector) {
[...$$(selector)].forEach(btn =>
btn.addEventListener('click', this.onClick.bind(this)))
}
/**
* Toggle aria-expanded attribute on button and visibility of section.
* @param {node} Event.target - The accordion button.
*/
onClick({ target }) {
const exp = target.getAttribute('aria-expanded') === 'true' || false;
target.setAttribute('aria-expanded', !exp);
target.parentElement.nextElementSibling.hidden = exp;
}
}

View File

@ -1,72 +0,0 @@
'use strict';
import { Templater, handleResponse } from './util.js';
export default class Changelog {
/**
* Fetch and render changelog from GitHub. Clones a template node (table row)
* to avoid doubling templating markup in JavaScript.
* @param {string} user - GitHub username.
* @param {string} repo - Repository to fetch releases from.
*/
constructor(user, repo) {
this.url = `https://api.github.com/repos/${user}/${repo}/releases`;
this.template = new Templater('changelog');
this.fetchChangelog()
.then(json => this.render(json))
.catch(this.showError.bind(this));
// make sure scroll positions for progress bar etc. are recalculated
window.dispatchEvent(new Event('resize'));
}
fetchChangelog() {
return new Promise((resolve, reject) =>
fetch(this.url)
.then(res => handleResponse(res))
.then(json => json.ok ? resolve(json) : reject()))
}
showError() {
this.template.get('error').style.display = 'block';
}
/**
* Get template section from template row. Hacky, but does make sense.
* @param {node} item - Parent element.
* @param {string} id - ID of child element, set via data-changelog.
*/
getField(item, id) {
return item.querySelector(`[data-changelog="${id}"]`);
}
render(json) {
this.template.get('table').style.display = 'block';
this.row = this.template.get('item');
this.releases = this.template.get('releases');
this.prereleases = this.template.get('prereleases');
Object.values(json)
.filter(release => release.name)
.forEach(release => this.renderRelease(release));
this.row.remove();
}
/**
* Clone the template row and populate with content from API response.
* https://developer.github.com/v3/repos/releases/#list-releases-for-a-repository
* @param {string} name - Release title.
* @param {string} tag (tag_name) - Release tag.
* @param {string} url (html_url) - URL to the release page on GitHub.
* @param {string} date (published_at) - Timestamp of release publication.
* @param {boolean} prerelease - Whether the release is a prerelease.
*/
renderRelease({ name, tag_name: tag, html_url: url, published_at: date, prerelease }) {
const container = prerelease ? this.prereleases : this.releases;
const tagLink = `<a href="${url}" target="_blank"><code>${tag}</code></a>`;
const title = (name.split(': ').length == 2) ? name.split(': ')[1] : name;
const row = this.row.cloneNode(true);
this.getField(row, 'date').textContent = date.split('T')[0];
this.getField(row, 'tag').innerHTML = tagLink;
this.getField(row, 'title').textContent = title;
container.appendChild(row);
}
}

View File

@ -0,0 +1,40 @@
/**
* Initialise changelog table for releases and prereleases
* @param {string} selector - The element selector to initialise the app.
* @param {string} repo - Repository to load from, in the format user/repo.
*/
export default function(selector, repo) {
new Vue({
el: selector,
data: {
url: `https://api.github.com/repos/${repo}/releases`,
releases: [],
prereleases: [],
error: false
},
beforeMount() {
fetch(this.url)
.then(res => res.json())
.then(json => this.$_update(json))
.catch(err => { this.error = true });
},
updated() {
window.dispatchEvent(new Event('resize')); // scroll position for progress
},
methods: {
$_update(json) {
const allReleases = Object.values(json)
.filter(release => release.name)
.map(release => ({
title: (release.name.split(': ').length == 2) ? release.name.split(': ')[1] : release.name,
url: release.html_url,
date: release.published_at.split('T')[0],
tag: release.tag_name,
pre: release.prerelease
}));
this.releases = allReleases.filter(release => !release.pre);
this.prereleases = allReleases.filter(release => release.pre);
}
}
});
}

View File

@ -1,42 +0,0 @@
'use strict';
import { $$ } from './util.js';
export default class GitHubEmbed {
/**
* Embed code from GitHub repositories, similar to Gist embeds. Fetches the
* raw text and places it inside element.
* Usage: <pre><code data-gh-embed="spacy/master/examples/x.py"></code><pre>
* @param {string} user - GitHub user or organization.
* @param {string} attr - Data attribute used to select containers. Attribute
* value should be path to file relative to user.
*/
constructor(user, attr) {
this.url = `https://raw.githubusercontent.com/${user}`;
this.attr = attr;
[...$$(`[${this.attr}]`)].forEach(el => this.embed(el));
}
/**
* Fetch code from GitHub and insert it as element content. File path is
* read off the container's data attribute.
* @param {node} el - The element.
*/
embed(el) {
el.parentElement.setAttribute('data-loading', '');
fetch(`${this.url}/${el.getAttribute(this.attr)}`)
.then(res => res.text().then(text => ({ text, ok: res.ok })))
.then(({ text, ok }) => ok ? this.render(el, text) : false)
el.parentElement.removeAttribute('data-loading');
}
/**
* Add text to container and apply syntax highlighting via Prism, if available.
* @param {node} el - The element.
* @param {string} text - The raw code, fetched from GitHub.
*/
render(el, text) {
el.textContent = text;
if (window.Prism) Prism.highlightElement(el);
}
}

148
website/assets/js/main.js Normal file
View File

@ -0,0 +1,148 @@
/**
* Initialise changelog
*/
import initChangelog from './changelog.vue.js';
{
const selector = '[data-vue="changelog"]';
if (window.Vue && document.querySelector(selector)) {
initChangelog(selector, 'explosion/spacy');
}
}
/**
* Initialise models
*/
import initModels from './models.vue.js';
{
if (window.Vue && document.querySelector('[data-model]')) {
initModels('explosion/spacy-models')
}
}
/**
* Initialise Universe
*/
import initUniverse from './universe.vue.js';
{
const selector = '[data-vue="universe"]';
if (window.Vue && document.querySelector(selector)) {
initUniverse(selector, '/universe/universe.json');
}
}
/**
* Initialise Quickstart
*/
if (document.querySelector('#qs') && window.Quickstart) {
new Quickstart('#qs');
}
/**
* Thebelabs
*/
if (window.thebelab) {
window.thebelab.on('status', (ev, data) => {
if (data.status == 'failed') {
const msg = "Failed to connect to kernel :( This can happen if too many users are active at the same time. Please reload the page and try again!";
const wrapper = `<span style="white-space: pre-wrap">${msg}</span>`;
document.querySelector('.jp-OutputArea-output pre').innerHTML = wrapper;
}
});
}
/**
* Highlight section in viewport in sidebar, using in-view library
*/
{
const sectionAttr = 'data-section';
const navAttr = 'data-nav';
const activeClass = 'is-active';
const sections = [...document.querySelectorAll(`[${navAttr}]`)];
if (window.inView) {
if (sections.length) { // highlight first item regardless
sections[0].classList.add(activeClass);
}
inView(`[${sectionAttr}]`).on('enter', section => {
const id = section.getAttribute(sectionAttr);
const el = document.querySelector(`[${navAttr}="${id}"]`);
if (el) {
sections.forEach(el => el.classList.remove(activeClass));
el.classList.add(activeClass);
}
});
}
}
/**
* Simple, collapsible accordion sections.
* Inspired by: https://inclusive-components.design/collapsible-sections/
*/
{
const elements = [...document.querySelectorAll('.js-accordion')];
elements.forEach(el => el.addEventListener('click', ({ target }) => {
const exp = target.getAttribute('aria-expanded') === 'true' || false;
target.setAttribute('aria-expanded', !exp);
target.parentElement.nextElementSibling.hidden = exp;
}));
}
/**
* Reading indicator as progress bar
* @param {string} selector - Selector of <progress> element.
*/
class ProgressBar {
constructor(selector) {
this.scrollY = 0;
this.sizes = this.updateSizes();
this.el = document.querySelector(selector);
this.el.setAttribute('max', 100);
window.addEventListener('scroll', ev => {
this.scrollY = (window.pageYOffset || document.scrollTop) - (document.clientTop || 0);
requestAnimationFrame(this.update.bind(this));
});
window.addEventListener('resize', ev => {
this.sizes = this.updateSizes();
requestAnimationFrame(this.update.bind(this));
});
}
update() {
const offset = 100 - ((this.sizes.height - this.scrollY - this.sizes.vh) / this.sizes.height * 100);
this.el.setAttribute('value', (this.scrollY == 0) ? 0 : offset || 0);
}
updateSizes() {
return {
height: Math.max(document.body.scrollHeight, document.body.offsetHeight, document.documentElement.clientHeight, document.documentElement.scrollHeight, document.documentElement.offsetHeight),
vh: Math.max(document.documentElement.clientHeight, window.innerHeight || 0)
}
}
}
new ProgressBar('.js-progress');
/**
* Embed code from GitHub repositories, similar to Gist embeds. Fetches the
* raw text and places it inside element.
* Usage: <pre><code data-gh-embed="spacy/master/examples/x.py"></code><pre>
*/
{
const attr = 'data-gh-embed';
const url = 'https://raw.githubusercontent.com/explosion';
const elements = [...document.querySelectorAll(`[${attr}]`)];
elements.forEach(el => {
el.parentElement.setAttribute('data-loading', '');
fetch(`${url}/${el.getAttribute(attr)}`)
.then(res => res.text().then(text => ({ text, ok: res.ok })))
.then(({ text, ok }) => {
if (ok) {
el.textContent = text;
if (window.Prism) Prism.highlightElement(el);
}
el.parentElement.removeAttribute('data-loading');
})
});
}

View File

@ -1,332 +0,0 @@
'use strict';
import { Templater, handleResponse, convertNumber, abbrNumber } from './util.js';
/**
* Chart.js defaults
*/
const CHART_COLORS = { model1: '#09a3d5', model2: '#066B8C' };
const CHART_FONTS = {
legend: '-apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"',
ticks: 'Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace'
};
/**
* Formatters for model details.
* @property {function} author Format model author with optional link.
* @property {function} license - Format model license with optional link.
* @property {function} sources - Format training data sources (list or string).
* @property {function} pipeline - Format list of pipeline components.
* @property {function} vectors - Format vector data (entries and dimensions).
* @property {function} version - Format model version number.
*/
const formats = {
author: (author, url) => url ? `<a href="${url}" target="_blank">${author}</a>` : author,
license: (license, url) => url ? `<a href="${url}" target="_blank">${license}</a>` : license,
sources: sources => (sources instanceof Array) ? sources.join(', ') : sources,
pipeline: pipes => (pipes && pipes.length) ? pipes.map(p => `<code>${p}</code>`).join(', ') : '-',
vectors: vec => formatVectors(vec),
version: version => `<code>v${version}</code>`
};
/**
* Format word vectors data depending on contents.
* @property {Object} data - The vectors object from the model's meta.json.
*/
const formatVectors = data => {
if (!data) return 'n/a';
if (Object.values(data).every(n => n == 0)) return 'context vectors only';
const { keys, vectors: vecs, width } = data;
return `${abbrNumber(keys)} keys, ${abbrNumber(vecs)} unique vectors (${width} dimensions)`;
}
/**
* Find the latest version of a model in a compatibility table.
* @param {string} model - The model name.
* @param {Object} compat - Compatibility table, keyed by spaCy version.
*/
const getLatestVersion = (model, compat = {}) => {
for (let [spacy_v, models] of Object.entries(compat)) {
if (models[model]) return models[model][0];
}
};
export class ModelLoader {
/**
* Load model meta from GitHub and update model details on site. Uses the
* Templater mini template engine to update DOM.
* @param {string} repo - Path tp GitHub repository containing releases.
* @param {Array} models - List of model IDs, e.g. "en_core_web_sm".
* @param {Object} licenses - License IDs mapped to URLs.
* @param {Object} benchmarkKeys - Objects of available keys by type, e.g.
* 'parser', 'ner', 'speed', mapped to labels.
*/
constructor(repo, models = [], licenses = {}, benchmarkKeys = {}) {
this.url = `https://raw.githubusercontent.com/${repo}/master`;
this.repo = `https://github.com/${repo}`;
this.modelIds = models;
this.licenses = licenses;
this.benchKeys = benchmarkKeys;
this.init();
}
init() {
this.modelIds.forEach(modelId =>
new Templater(modelId).get('table').setAttribute('data-loading', ''));
this.fetch(`${this.url}/compatibility.json`)
.then(json => this.getModels(json.spacy))
.catch(_ => this.modelIds.forEach(modelId => this.showError(modelId)));
// make sure scroll positions for progress bar etc. are recalculated
window.dispatchEvent(new Event('resize'));
}
fetch(url) {
return new Promise((resolve, reject) =>
fetch(url).then(res => handleResponse(res))
.then(json => json.ok ? resolve(json) : reject()))
}
getModels(compat) {
this.compat = compat;
for (let modelId of this.modelIds) {
const version = getLatestVersion(modelId, compat);
if (version) this.fetch(`${this.url}/meta/${modelId}-${version}.json`)
.then(json => this.render(json))
.catch(_ => this.showError(modelId))
else this.showError(modelId);
}
}
showError(modelId) {
const tpl = new Templater(modelId);
tpl.get('table').removeAttribute('data-loading');
tpl.get('error').hidden = false;
for (let key of ['sources', 'pipeline', 'vecs', 'author', 'license']) {
tpl.get(key).parentElement.parentElement.hidden = true;
}
}
/**
* Update model details in tables. Currently quite hacky :(
*/
render(data) {
const modelId = `${data.lang}_${data.name}`;
const model = `${modelId}-${data.version}`;
const tpl = new Templater(modelId);
this.renderDetails(tpl, data)
this.renderBenchmarks(tpl, data.accuracy, data.speed);
this.renderCompat(tpl, modelId);
tpl.get('download').setAttribute('href', `${this.repo}/releases/tag/${model}`);
tpl.get('table').removeAttribute('data-loading');
tpl.get('error').hidden = true;
}
renderDetails(tpl, { version, size, description, notes, author, url,
license, sources, vectors, pipeline }) {
const basics = { version, size, description, notes }
for (let [key, value] of Object.entries(basics)) {
if (value) tpl.fill(key, value);
}
if (author) tpl.fill('author', formats.author(author, url), true);
if (license) tpl.fill('license', formats.license(license, this.licenses[license]), true);
if (sources) tpl.fill('sources', formats.sources(sources));
if (vectors) tpl.fill('vecs', formats.vectors(vectors));
else tpl.get('vecs').parentElement.parentElement.hidden = true;
if (pipeline && pipeline.length) tpl.fill('pipeline', formats.pipeline(pipeline), true);
else tpl.get('pipeline').parentElement.parentElement.hidden = true;
}
renderBenchmarks(tpl, accuracy = {}, speed = {}) {
if (!accuracy && !speed) return;
this.renderTable(tpl, 'parser', accuracy, val => val.toFixed(2));
this.renderTable(tpl, 'ner', accuracy, val => val.toFixed(2));
tpl.get('benchmarks').hidden = false;
}
renderTable(tpl, id, benchmarks, converter = val => val) {
if (!this.benchKeys[id] || !Object.keys(this.benchKeys[id]).some(key => benchmarks[key])) return;
for (let key of Object.keys(this.benchKeys[id])) {
if (benchmarks[key]) tpl
.fill(key, convertNumber(converter(benchmarks[key])))
.parentElement.hidden = false;
}
tpl.get(id).hidden = false;
}
renderCompat(tpl, modelId) {
tpl.get('compat-wrapper').hidden = false;
const header = '<option selected disabled>spaCy version</option>';
const options = Object.keys(this.compat)
.map(v => `<option value="${v}">v${v}</option>`)
.join('');
tpl
.fill('compat', header + options, true)
.addEventListener('change', ({ target: { value }}) =>
tpl.fill('compat-versions', this.getCompat(value, modelId), true))
}
getCompat(version, model) {
const res = this.compat[version][model];
return res ? `<code>${model}-${res[0]}</code>` : '<em>not compatible</em>';
}
}
export class ModelComparer {
/**
* Compare to model meta files and render chart and comparison table.
* @param {string} repo - Path tp GitHub repository containing releases.
* @param {Object} licenses - License IDs mapped to URLs.
* @param {Object} benchmarkKeys - Objects of available keys by type, e.g.
* 'parser', 'ner', 'speed', mapped to labels.
* @param {Object} languages - Available languages, ID mapped to name.
* @param {Object} defaultModels - Models to compare on load, 'model1' and
* 'model2' mapped to model names.
*/
constructor(repo, licenses = {}, benchmarkKeys = {}, languages = {}, labels = {}, defaultModels) {
this.url = `https://raw.githubusercontent.com/${repo}/master`;
this.repo = `https://github.com/${repo}`;
this.tpl = new Templater('compare');
this.benchKeys = benchmarkKeys;
this.licenses = licenses;
this.languages = languages;
this.labels = labels;
this.models = {};
this.colors = CHART_COLORS;
this.fonts = CHART_FONTS;
this.defaultModels = defaultModels;
this.tpl.get('result').hidden = false;
this.tpl.get('error').hidden = true;
this.fetchCompat()
.then(compat => this.init(compat))
.catch(this.showError.bind(this))
}
init(compat) {
this.compat = compat;
const selectA = this.tpl.get('model1');
const selectB = this.tpl.get('model2');
selectA.addEventListener('change', this.onSelect.bind(this));
selectB.addEventListener('change', this.onSelect.bind(this));
this.chart = new Chart('chart_compare_accuracy', { type: 'bar', options: {
responsive: true,
legend: { position: 'bottom', labels: { fontFamily: this.fonts.legend, fontSize: 13 }},
scales: {
yAxes: [{ label: 'Accuracy', ticks: { min: 70, fontFamily: this.fonts.ticks }}],
xAxes: [{ barPercentage: 0.75, ticks: { fontFamily: this.fonts.ticks }}]
}
}});
if (this.defaultModels) {
selectA.value = this.defaultModels.model1;
selectB.value = this.defaultModels.model2;
this.getModels(this.defaultModels);
}
}
fetchCompat() {
return new Promise((resolve, reject) =>
fetch(`${this.url}/compatibility.json`)
.then(res => handleResponse(res))
.then(json => json.ok ? resolve(json.spacy) : reject()))
}
fetchModel(name) {
const version = getLatestVersion(name, this.compat);
const modelName = `${name}-${version}`;
return new Promise((resolve, reject) => {
if (!version) reject();
// resolve immediately if model already loaded, e.g. in this.models
else if (this.models[name]) resolve(this.models[name]);
else fetch(`${this.url}/meta/${modelName}.json`)
.then(res => handleResponse(res))
.then(json => json.ok ? resolve(this.saveModel(name, json)) : reject())
})
}
/**
* "Save" meta to this.models so it only has to be fetched from GitHub once.
* @param {string} name - The model name.
* @param {Object} data - The model meta data.
*/
saveModel(name, data) {
this.models[name] = data;
return data;
}
showError(err) {
console.error(err || 'Error');
this.tpl.get('result').hidden = true;
this.tpl.get('error').hidden = false;
}
onSelect(ev) {
const modelId = ev.target.value;
const otherId = (ev.target.id == 'model1') ? 'model2' : 'model1';
const otherVal = this.tpl.get(otherId);
const otherModel = otherVal.options[otherVal.selectedIndex].value;
if (otherModel != '') this.getModels({
[ev.target.id]: modelId,
[otherId]: otherModel
})
}
getModels({ model1, model2 }) {
this.tpl.get('result').setAttribute('data-loading', '');
this.fetchModel(model1)
.then(data1 => this.fetchModel(model2)
.then(data2 => this.render({ model1: data1, model2: data2 })))
.catch(this.showError.bind(this))
}
/**
* Render two models, and populate the chart and table. Currently quite hacky :(
* @param {Object} models - The models to render.
* @param {Object} models.model1 - The first model (via first <select>).
* @param {Object} models.model2 - The second model (via second <select>).
*/
render({ model1, model2 }) {
const accKeys = Object.assign({}, this.benchKeys.parser, this.benchKeys.ner);
const allKeys = [...Object.keys(model1.accuracy || []), ...Object.keys(model2.accuracy || [])];
const metaKeys = Object.keys(accKeys).filter(k => allKeys.includes(k));
const labels = metaKeys.map(key => accKeys[key]);
const datasets = [model1, model2]
.map(({ lang, name, version, accuracy = {} }, i) => ({
label: `${lang}_${name}-${version}`,
backgroundColor: this.colors[`model${i + 1}`],
data: metaKeys.map(key => (accuracy[key] || 0).toFixed(2))
}));
this.chart.data = { labels, datasets };
this.chart.update();
[model1, model2].forEach((model, i) => this.renderTable(metaKeys, i + 1, model));
this.tpl.get('result').removeAttribute('data-loading');
this.tpl.get('error').hidden = true;
this.tpl.get('result').hidden = false;
}
renderTable(metaKeys, i, { lang, name, version, size, description,
notes, author, url, license, sources, vectors, pipeline, accuracy = {},
speed = {}}) {
const type = name.split('_')[0]; // extract type from model name
const genre = name.split('_')[1]; // extract genre from model name
this.tpl.fill(`table-head${i}`, `${lang}_${name}`);
this.tpl.get(`link${i}`).setAttribute('href', `/models/${lang}#${lang}_${name}`);
this.tpl.fill(`download${i}`, `python -m spacy download ${lang}_${name}\n`);
this.tpl.fill(`lang${i}`, this.languages[lang] || lang);
this.tpl.fill(`type${i}`, this.labels[type] || type);
this.tpl.fill(`genre${i}`, this.labels[genre] || genre);
this.tpl.fill(`version${i}`, formats.version(version), true);
this.tpl.fill(`size${i}`, size);
this.tpl.fill(`desc${i}`, description || 'n/a');
this.tpl.fill(`pipeline${i}`, formats.pipeline(pipeline), true);
this.tpl.fill(`vecs${i}`, formats.vectors(vectors));
this.tpl.fill(`sources${i}`, formats.sources(sources));
this.tpl.fill(`author${i}`, formats.author(author, url), true);
this.tpl.fill(`license${i}`, formats.license(license, this.licenses[license]), true);
// check if model accuracy or speed includes one of the pre-set keys
const allKeys = [].concat(...Object.entries(this.benchKeys).map(([_, v]) => Object.keys(v)));
for (let key of allKeys) {
if (accuracy[key]) this.tpl.fill(`${key}${i}`, accuracy[key].toFixed(2))
else this.tpl.fill(`${key}${i}`, 'n/a')
}
}
}

View File

@ -0,0 +1,138 @@
/**
* Initialise model overviews
* @param {string} repo - Repository to load from, in the format user/repo.
*/
export default function(repo) {
const LICENSES = {
'CC BY 4.0': 'https://creativecommons.org/licenses/by/4.0/',
'CC BY-SA': 'https://creativecommons.org/licenses/by-sa/3.0/',
'CC BY-SA 3.0': 'https://creativecommons.org/licenses/by-sa/3.0/',
'CC BY-SA 4.0': 'https://creativecommons.org/licenses/by-sa/4.0/',
'CC BY-NC': 'https://creativecommons.org/licenses/by-nc/3.0/',
'CC BY-NC 3.0': 'https://creativecommons.org/licenses/by-nc/3.0/',
'CC-BY-NC-SA 3.0': 'https://creativecommons.org/licenses/by-nc-sa/3.0/',
'GPL': 'https://www.gnu.org/licenses/gpl.html',
'LGPL': 'https://www.gnu.org/licenses/lgpl.html',
'MIT': 'https://opensource.org/licenses/MIT'
};
const URL = `https://raw.githubusercontent.com/${repo}/master`;
const models = [...document.querySelectorAll('[data-vue]')]
.map(el => el.getAttribute('data-vue'));
document.addEventListener('DOMContentLoaded', ev => {
fetch(`${URL}/compatibility.json`)
.then(res => res.json())
.then(json => models.forEach(modelId => new Vue({
el: `[data-vue="${modelId}"]`,
data: {
repo: `https://github.com/${repo}`,
compat: json.spacy,
loading: false,
error: false,
id: modelId,
version: 'n/a',
notes: null,
sizeFull: null,
pipeline: null,
releaseUrl: null,
description: null,
license: null,
author: null,
url: null,
vectors: null,
sources: null,
uas: null,
las: null,
tags_acc: null,
ents_f: null,
ents_p: null,
ents_r: null,
modelLicenses: LICENSES,
spacyVersion: Object.keys(json.spacy)[0]
},
computed: {
compatVersion() {
const res = this.compat[this.spacyVersion][this.id];
return res ? `${this.id}-${res[0]}` : false;
},
orderedCompat() {
return Object.keys(this.compat)
.filter(v => !v.includes('a') && !v.includes('dev') && !v.includes('rc'));
},
hasAccuracy() {
return this.uas || this.las || this.tags_acc || this.ents_f || this.ents_p || this.ents_r;
}
},
beforeMount() {
const version = this.$_getLatestVersion(this.id);
if (version) {
this.loading = true;
fetch(`${URL}/meta/${this.id}-${version}.json`)
.then(res => res.json())
.then(json => this.$_updateData(json))
.catch(err => { this.error = true });
}
},
updated() {
window.dispatchEvent(new Event('resize')); // scroll position for progress
},
methods: {
$_updateData(data) {
const fullName = `${data.lang}_${data.name}-${data.version}`;
this.version = data.version;
this.releaseUrl = `${this.repo}/releases/tag/${fullName}`;
this.sizeFull = data.size;
this.pipeline = data.pipeline;
this.notes = data.notes;
this.description = data.description;
this.vectors = this.$_formatVectors(data.vectors);
this.sources = data.sources;
this.author = data.author;
this.url = data.url;
this.license = data.license;
const accuracy = data.accuracy || {};
for (let key of Object.keys(accuracy)) {
this[key] = accuracy[key].toFixed(2);
}
this.loading = false;
},
$_getLatestVersion(modelId) {
for (let [spacy_v, models] of Object.entries(this.compat)) {
if (models[modelId]) {
return models[modelId][0];
}
}
},
$_formatVectors(data) {
if (!data) {
return 'n/a';
}
if (Object.values(data).every(n => n == 0)) {
return 'context vectors only';
}
const { keys, vectors, width } = data;
const nKeys = this.$_abbrNum(keys);
const nVectors = this.$_abbrNum(vectors);
return `${nKeys} keys, ${nVectors} unique vectors (${width} dimensions)`;
},
/**
* Abbreviate a number, e.g. 14249930 --> 14.25m.
* @param {number|string} num - The number to convert.
* @param {number} fixed - Number of decimals.
*/
$_abbrNum: function(num = 0, fixed = 1) {
const suffixes = ['', 'k', 'm', 'b', 't'];
if (num === null || num === 0) return 0;
const b = num.toPrecision(2).split('e');
const k = (b.length === 1) ? 0 : Math.floor(Math.min(b[1].slice(1), 14) / 3);
const n = (k < 1) ? num : num / Math.pow(10, k * 3);
const c = (k >= 1 && n >= 100 ) ? Math.round(n) : n.toFixed(fixed);
return (c < 0 ? c : Math.abs(c)) + suffixes[k];
}
}
})))
});
}

View File

@ -1,35 +0,0 @@
'use strict';
import { $, $$ } from './util.js';
export default class NavHighlighter {
/**
* Hightlight section in viewport in sidebar, using in-view library.
* @param {string} sectionAttr - Data attribute of sections.
* @param {string} navAttr - Data attribute of navigation items.
* @param {string} activeClass Class name of active element.
*/
constructor(sectionAttr, navAttr, activeClass = 'is-active') {
this.sections = [...$$(`[${navAttr}]`)];
// highlight first item regardless
if (this.sections.length) this.sections[0].classList.add(activeClass);
this.navAttr = navAttr;
this.sectionAttr = sectionAttr;
this.activeClass = activeClass;
if (window.inView) inView(`[${sectionAttr}]`)
.on('enter', this.highlightSection.bind(this));
}
/**
* Check if section in view exists in sidebar and mark as active.
* @param {node} section - The section in view.
*/
highlightSection(section) {
const id = section.getAttribute(this.sectionAttr);
const el = $(`[${this.navAttr}="${id}"]`);
if (el) {
this.sections.forEach(el => el.classList.remove(this.activeClass));
el.classList.add(this.activeClass);
}
}
}

View File

@ -1,52 +0,0 @@
'use strict';
import { $ } from './util.js';
export default class ProgressBar {
/**
* Animated reading progress bar.
* @param {string} selector CSS selector of progress bar element.
*/
constructor(selector) {
this.scrollY = 0;
this.sizes = this.updateSizes();
this.el = $(selector);
this.el.setAttribute('max', 100);
window.addEventListener('scroll', this.onScroll.bind(this));
window.addEventListener('resize', this.onResize.bind(this));
}
onScroll(ev) {
this.scrollY = (window.pageYOffset || document.scrollTop) - (document.clientTop || 0);
requestAnimationFrame(this.update.bind(this));
}
onResize(ev) {
this.sizes = this.updateSizes();
requestAnimationFrame(this.update.bind(this));
}
update() {
const offset = 100 - ((this.sizes.height - this.scrollY - this.sizes.vh) / this.sizes.height * 100);
this.el.setAttribute('value', (this.scrollY == 0) ? 0 : offset || 0);
}
/**
* Update scroll and viewport height. Called on load and window resize.
*/
updateSizes() {
return {
height: Math.max(
document.body.scrollHeight,
document.body.offsetHeight,
document.documentElement.clientHeight,
document.documentElement.scrollHeight,
document.documentElement.offsetHeight
),
vh: Math.max(
document.documentElement.clientHeight,
window.innerHeight || 0
)
}
}
}

View File

@ -1,25 +0,0 @@
/**
* This file is bundled by Rollup, compiled with Babel and included as
* <script nomodule> for older browsers that don't yet support JavaScript
* modules. Browsers that do will ignore this bundle and won't even fetch it
* from the server. Details:
* https://github.com/rollup/rollup
* https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
*/
// Import all modules that are instantiated directly in _includes/_scripts.jade
import ProgressBar from './progress.js';
import NavHighlighter from './nav-highlighter.js';
import Changelog from './changelog.js';
import GitHubEmbed from './github-embed.js';
import Accordion from './accordion.js';
import { ModelLoader, ModelComparer } from './models.js';
// Assign to window so they are bundled by rollup
window.ProgressBar = ProgressBar;
window.NavHighlighter = NavHighlighter;
window.Changelog = Changelog;
window.GitHubEmbed = GitHubEmbed;
window.Accordion = Accordion;
window.ModelLoader = ModelLoader;
window.ModelComparer = ModelComparer;

View File

@ -0,0 +1,124 @@
export default function(selector, dataPath) {
Vue.use(VueMarkdown);
new Vue({
el: selector,
data: {
filteredResources: [],
allResources: [],
projectCats: {},
educationCats: {},
filterVals: ['category'],
activeMenu: 'all',
selected: null,
loading: false
},
computed: {
resources() {
return this.filteredResources.sort((a, b) => a.id.localeCompare(b.id));
},
categories() {
return Object.assign({}, this.projectCats, this.educationCats);
}
},
beforeMount() {
this.loading = true;
window.addEventListener('popstate', this.$_init);
fetch(dataPath)
.then(res => res.json())
.then(({ resources, projectCats, educationCats }) => {
this.allResources = resources || [];
this.filteredResources = resources || [];
this.projectCats = projectCats || {};
this.educationCats = educationCats || {};
this.$_init();
this.loading = false;
});
},
updated() {
if (window.Prism) Prism.highlightAll();
// make sure scroll positions for progress bar etc. are recalculated
window.dispatchEvent(new Event('resize'));
},
methods: {
getAuthorLink(id, link) {
if (id == 'twitter') return `https://twitter.com/${link}`;
else if (id == 'github') return `https://github.com/${link}`;
return link;
},
filterBy(id, selector = 'category') {
window.scrollTo(0, 0);
if (!this.filterVals.includes(selector)) {
return;
}
const resources = this.$_filterResources(id, selector);
if (!resources.length) return;
this.selected = null;
this.activeMenu = id;
this.filteredResources = resources;
},
viewResource(id) {
const res = this.allResources.find(r => r.id == id);
if (!res) return;
this.selected = res;
this.activeMenu = null;
if (this.$_getQueryVar('id') != res.id) {
this.$_updateUrl({ id: res.id });
}
window.scrollTo(0, 0);
},
$_filterResources(id, selector) {
if (id == 'all') {
if (window.location.search != '') {
this.$_updateUrl({});
}
return this.allResources;
}
const resources = this.allResources
.filter(res => (res[selector] || []).includes(id));
if (resources.length && this.$_getQueryVar(selector) != id) {
this.$_updateUrl({ [selector]: id });
}
return resources;
},
$_init() {
const current = this.$_getQueryVar('id');
if (current) {
this.viewResource(current);
return;
}
for (let filterVal of this.filterVals) {
const queryVar = this.$_getQueryVar(filterVal);
if (queryVar) {
this.filterBy(queryVar, filterVal);
return;
}
}
this.filterBy('all');
},
$_getQueryVar(key) {
const query = window.location.search.substring(1);
const params = query.split('&').map(param => param.split('='));
for(let param of params) {
if (param[0] == key) {
return decodeURIComponent(param[1]);
}
}
return false;
},
$_updateUrl(params) {
const loc = Object.keys(params)
.map(param => `${param}=${encodeURIComponent(params[param])}`);
const url = loc.length ? '?' + loc.join('&') : window.location.origin + window.location.pathname;
window.history.pushState(params, null, url);
}
}
})
}

View File

@ -1,70 +0,0 @@
'use strict';
export const $ = document.querySelector.bind(document);
export const $$ = document.querySelectorAll.bind(document);
export class Templater {
/**
* Mini templating engine based on data attributes. Selects elements based
* on a data-tpl and data-tpl-key attribute and can set textContent
* and innterHtml.
* @param {string} templateId - Template section, e.g. value of data-tpl.
*/
constructor(templateId) {
this.templateId = templateId;
}
/**
* Get an element from the template and return it.
* @param {string} key - Name of the key within the current template.
*/
get(key) {
return $(`[data-tpl="${this.templateId}"][data-tpl-key="${key}"]`);
}
/**
* Fill the content of a template element with a value.
* @param {string} key - Name of the key within the current template.
* @param {string} value - Content to insert into template element.
* @param {boolean} html - Insert content as HTML. Defaults to false.
*/
fill(key, value, html = false) {
const el = this.get(key);
if (html) el.innerHTML = value || '';
else el.textContent = value || '';
return el;
}
}
/**
* Handle API response and assign status to returned JSON.
* @param {Response} res The response.
*/
export const handleResponse = res => {
if (res.ok) return res.json()
.then(json => Object.assign({}, json, { ok: res.ok }))
else return ({ ok: res.ok })
};
/**
* Convert a number to a string and add thousand separator.
* @param {number|string} num - The number to convert.
* @param {string} separator Thousand separator.
*/
export const convertNumber = (num = 0, separator = ',') =>
num.toString().replace(/\B(?=(\d{3})+(?!\d))/g, separator);
/**
* Abbreviate a number, e.g. 14249930 --> 14.25m.
* @param {number|string} num - The number to convert.
* @param {number} fixed - Number of decimals.
*/
export const abbrNumber = (num = 0, fixed = 1) => {
const suffixes = ['', 'k', 'm', 'b', 't'];
if (num === null || num === 0) return 0;
const b = num.toPrecision(2).split('e');
const k = (b.length === 1) ? 0 : Math.floor(Math.min(b[1].slice(1), 14) / 3);
const n = (k < 1) ? num : num / Math.pow(10, k * 3);
const c = (k >= 1 && n >= 100 ) ? Math.round(n) : n.toFixed(fixed);
return (c < 0 ? c : Math.abs(c)) + suffixes[k];
}

File diff suppressed because one or more lines are too long

View File

@ -23,3 +23,6 @@ Prism.languages.scss=Prism.languages.extend("css",{comment:{pattern:/(^|[^\\])(?
Prism.languages.sql={comment:{pattern:/(^|[^\\])(?:\/\*[\w\W]*?\*\/|(?:--|\/\/|#).*)/,lookbehind:!0},string:{pattern:/(^|[^@\\])("|')(?:\\?[\s\S])*?\2/,lookbehind:!0},variable:/@[\w.$]+|@("|'|`)(?:\\?[\s\S])+?\1/,"function":/\b(?:COUNT|SUM|AVG|MIN|MAX|FIRST|LAST|UCASE|LCASE|MID|LEN|ROUND|NOW|FORMAT)(?=\s*\()/i,keyword:/\b(?:ACTION|ADD|AFTER|ALGORITHM|ALL|ALTER|ANALYZE|ANY|APPLY|AS|ASC|AUTHORIZATION|BACKUP|BDB|BEGIN|BERKELEYDB|BIGINT|BINARY|BIT|BLOB|BOOL|BOOLEAN|BREAK|BROWSE|BTREE|BULK|BY|CALL|CASCADED?|CASE|CHAIN|CHAR VARYING|CHARACTER (?:SET|VARYING)|CHARSET|CHECK|CHECKPOINT|CLOSE|CLUSTERED|COALESCE|COLLATE|COLUMN|COLUMNS|COMMENT|COMMIT|COMMITTED|COMPUTE|CONNECT|CONSISTENT|CONSTRAINT|CONTAINS|CONTAINSTABLE|CONTINUE|CONVERT|CREATE|CROSS|CURRENT(?:_DATE|_TIME|_TIMESTAMP|_USER)?|CURSOR|DATA(?:BASES?)?|DATETIME|DBCC|DEALLOCATE|DEC|DECIMAL|DECLARE|DEFAULT|DEFINER|DELAYED|DELETE|DENY|DESC|DESCRIBE|DETERMINISTIC|DISABLE|DISCARD|DISK|DISTINCT|DISTINCTROW|DISTRIBUTED|DO|DOUBLE(?: PRECISION)?|DROP|DUMMY|DUMP(?:FILE)?|DUPLICATE KEY|ELSE|ENABLE|ENCLOSED BY|END|ENGINE|ENUM|ERRLVL|ERRORS|ESCAPE(?:D BY)?|EXCEPT|EXEC(?:UTE)?|EXISTS|EXIT|EXPLAIN|EXTENDED|FETCH|FIELDS|FILE|FILLFACTOR|FIRST|FIXED|FLOAT|FOLLOWING|FOR(?: EACH ROW)?|FORCE|FOREIGN|FREETEXT(?:TABLE)?|FROM|FULL|FUNCTION|GEOMETRY(?:COLLECTION)?|GLOBAL|GOTO|GRANT|GROUP|HANDLER|HASH|HAVING|HOLDLOCK|IDENTITY(?:_INSERT|COL)?|IF|IGNORE|IMPORT|INDEX|INFILE|INNER|INNODB|INOUT|INSERT|INT|INTEGER|INTERSECT|INTO|INVOKER|ISOLATION LEVEL|JOIN|KEYS?|KILL|LANGUAGE SQL|LAST|LEFT|LIMIT|LINENO|LINES|LINESTRING|LOAD|LOCAL|LOCK|LONG(?:BLOB|TEXT)|MATCH(?:ED)?|MEDIUM(?:BLOB|INT|TEXT)|MERGE|MIDDLEINT|MODIFIES SQL DATA|MODIFY|MULTI(?:LINESTRING|POINT|POLYGON)|NATIONAL(?: CHAR VARYING| CHARACTER(?: VARYING)?| VARCHAR)?|NATURAL|NCHAR(?: VARCHAR)?|NEXT|NO(?: SQL|CHECK|CYCLE)?|NONCLUSTERED|NULLIF|NUMERIC|OFF?|OFFSETS?|ON|OPEN(?:DATASOURCE|QUERY|ROWSET)?|OPTIMIZE|OPTION(?:ALLY)?|ORDER|OUT(?:ER|FILE)?|OVER|PARTIAL|PARTITION|PERCENT|PIVOT|PLAN|POINT|POLYGON|PRECEDING|PRECISION|PREV|PRIMARY|PRINT|PRIVILEGES|PROC(?:EDURE)?|PUBLIC|PURGE|QUICK|RAISERROR|READ(?:S SQL DATA|TEXT)?|REAL|RECONFIGURE|REFERENCES|RELEASE|RENAME|REPEATABLE|REPLICATION|REQUIRE|RESTORE|RESTRICT|RETURNS?|REVOKE|RIGHT|ROLLBACK|ROUTINE|ROW(?:COUNT|GUIDCOL|S)?|RTREE|RULE|SAVE(?:POINT)?|SCHEMA|SELECT|SERIAL(?:IZABLE)?|SESSION(?:_USER)?|SET(?:USER)?|SHARE MODE|SHOW|SHUTDOWN|SIMPLE|SMALLINT|SNAPSHOT|SOME|SONAME|START(?:ING BY)?|STATISTICS|STATUS|STRIPED|SYSTEM_USER|TABLES?|TABLESPACE|TEMP(?:ORARY|TABLE)?|TERMINATED BY|TEXT(?:SIZE)?|THEN|TIMESTAMP|TINY(?:BLOB|INT|TEXT)|TOP?|TRAN(?:SACTIONS?)?|TRIGGER|TRUNCATE|TSEQUAL|TYPES?|UNBOUNDED|UNCOMMITTED|UNDEFINED|UNION|UNIQUE|UNPIVOT|UPDATE(?:TEXT)?|USAGE|USE|USER|USING|VALUES?|VAR(?:BINARY|CHAR|CHARACTER|YING)|VIEW|WAITFOR|WARNINGS|WHEN|WHERE|WHILE|WITH(?: ROLLUP|IN)?|WORK|WRITE(?:TEXT)?)\b/i,"boolean":/\b(?:TRUE|FALSE|NULL)\b/i,number:/\b-?(?:0x)?\d*\.?[\da-f]+\b/,operator:/[-+*\/=%^~]|&&?|\|?\||!=?|<(?:=>?|<|>)?|>[>=]?|\b(?:AND|BETWEEN|IN|LIKE|NOT|OR|IS|DIV|REGEXP|RLIKE|SOUNDS LIKE|XOR)\b/i,punctuation:/[;[\]()`,.]/};
Prism.languages.wiki=Prism.languages.extend("markup",{"block-comment":{pattern:/(^|[^\\])\/\*[\w\W]*?\*\//,lookbehind:!0,alias:"comment"},heading:{pattern:/^(=+).+?\1/m,inside:{punctuation:/^=+|=+$/,important:/.+/}},emphasis:{pattern:/('{2,5}).+?\1/,inside:{"bold italic":{pattern:/(''''').+?(?=\1)/,lookbehind:!0},bold:{pattern:/(''')[^'](?:.*?[^'])?(?=\1)/,lookbehind:!0},italic:{pattern:/('')[^'](?:.*?[^'])?(?=\1)/,lookbehind:!0},punctuation:/^''+|''+$/}},hr:{pattern:/^-{4,}/m,alias:"punctuation"},url:[/ISBN +(?:97[89][ -]?)?(?:\d[ -]?){9}[\dx]\b|(?:RFC|PMID) +\d+/i,/\[\[.+?\]\]|\[.+?\]/],variable:[/__[A-Z]+__/,/\{{3}.+?\}{3}/,/\{\{.+?}}/],symbol:[/^#redirect/im,/~{3,5}/],"table-tag":{pattern:/((?:^|[|!])[|!])[^|\r\n]+\|(?!\|)/m,lookbehind:!0,inside:{"table-bar":{pattern:/\|$/,alias:"punctuation"},rest:Prism.languages.markup.tag.inside}},punctuation:/^(?:\{\||\|\}|\|-|[*#:;!|])|\|\||!!/m}),Prism.languages.insertBefore("wiki","tag",{nowiki:{pattern:/<(nowiki|pre|source)\b[\w\W]*?>[\w\W]*?<\/\1>/i,inside:{tag:{pattern:/<(?:nowiki|pre|source)\b[\w\W]*?>|<\/(?:nowiki|pre|source)>/i,inside:Prism.languages.markup.tag.inside}}}});
Prism.languages.yaml={scalar:{pattern:/([\-:]\s*(![^\s]+)?[ \t]*[|>])[ \t]*(?:((?:\r?\n|\r)[ \t]+)[^\r\n]+(?:\3[^\r\n]+)*)/,lookbehind:!0,alias:"string"},comment:/#.*/,key:{pattern:/(\s*[:\-,[{\r\n?][ \t]*(![^\s]+)?[ \t]*)[^\r\n{[\]},#]+?(?=\s*:\s)/,lookbehind:!0,alias:"atrule"},directive:{pattern:/(^[ \t]*)%.+/m,lookbehind:!0,alias:"important"},datetime:{pattern:/([:\-,[{]\s*(![^\s]+)?[ \t]*)(\d{4}-\d\d?-\d\d?([tT]|[ \t]+)\d\d?:\d{2}:\d{2}(\.\d*)?[ \t]*(Z|[-+]\d\d?(:\d{2})?)?|\d{4}-\d{2}-\d{2}|\d\d?:\d{2}(:\d{2}(\.\d*)?)?)(?=[ \t]*($|,|]|}))/m,lookbehind:!0,alias:"number"},"boolean":{pattern:/([:\-,[{]\s*(![^\s]+)?[ \t]*)(true|false)[ \t]*(?=$|,|]|})/im,lookbehind:!0,alias:"important"},"null":{pattern:/([:\-,[{]\s*(![^\s]+)?[ \t]*)(null|~)[ \t]*(?=$|,|]|})/im,lookbehind:!0,alias:"important"},string:{pattern:/([:\-,[{]\s*(![^\s]+)?[ \t]*)("(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')(?=[ \t]*($|,|]|}))/m,lookbehind:!0},number:{pattern:/([:\-,[{]\s*(![^\s]+)?[ \t]*)[+\-]?(0x[\da-f]+|0o[0-7]+|(\d+\.?\d*|\.?\d+)(e[\+\-]?\d+)?|\.inf|\.nan)[ \t]*(?=$|,|]|})/im,lookbehind:!0},tag:/![^\s]+/,important:/[&*][\w]+/,punctuation:/---|[:[\]{}\-,|>?]|\.\.\./};
Prism.languages.julia={comment:{pattern:/(^|[^\\])#.*/,lookbehind:!0},string:/("""|''')[\s\S]+?\1|("|')(?:\\.|(?!\2)[^\\\r\n])*\2/,keyword:/\b(?:abstract|baremodule|begin|bitstype|break|catch|ccall|const|continue|do|else|elseif|end|export|finally|for|function|global|if|immutable|import|importall|let|local|macro|module|print|println|quote|return|try|type|typealias|using|while)\b/,"boolean":/\b(?:true|false)\b/,number:/(?:\b(?=\d)|\B(?=\.))(?:0[box])?(?:[\da-f]+\.?\d*|\.\d+)(?:[efp][+-]?\d+)?j?/i,operator:/[-+*^%÷&$\\]=?|\/[\/=]?|!=?=?|\|[=>]?|<(?:<=?|[=:])?|>(?:=|>>?=?)?|==?=?|[~≠≤≥]/,punctuation:/[{}[\];(),.:]/};
Prism.languages.r={comment:/#.*/,string:{pattern:/(['"])(?:\\.|(?!\1)[^\\\r\n])*\1/,greedy:!0},"percent-operator":{pattern:/%[^%\s]*%/,alias:"operator"},"boolean":/\b(?:TRUE|FALSE)\b/,ellipsis:/\.\.(?:\.|\d+)/,number:[/\b(?:NaN|Inf)\b/,/(?:\b0x[\dA-Fa-f]+(?:\.\d*)?|\b\d+\.?\d*|\B\.\d+)(?:[EePp][+-]?\d+)?[iL]?/],keyword:/\b(?:if|else|repeat|while|function|for|in|next|break|NULL|NA|NA_integer_|NA_real_|NA_complex_|NA_character_)\b/,operator:/->?>?|<(?:=|<?-)?|[>=!]=?|::?|&&?|\|\|?|[+*\/^$@~]/,punctuation:/[(){}\[\],;]/};
Prism.languages.docker={keyword:{pattern:/(^\s*)(?:ADD|ARG|CMD|COPY|ENTRYPOINT|ENV|EXPOSE|FROM|HEALTHCHECK|LABEL|MAINTAINER|ONBUILD|RUN|SHELL|STOPSIGNAL|USER|VOLUME|WORKDIR)(?=\s)/im,lookbehind:!0},string:/("|')(?:(?!\1)[^\\\r\n]|\\(?:\r\n|[\s\S]))*\1/,comment:/#.*/,punctuation:/---|\.\.\.|[:[\]{}\-,|>?]/},Prism.languages.dockerfile=Prism.languages.docker;

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

6
website/assets/js/vendor/vue.min.js vendored Normal file

File diff suppressed because one or more lines are too long

View File

@ -53,15 +53,23 @@ include _includes/_mixins
.o-content
+grid
+grid-col("two-thirds")
+terminal("lightning_tour.py", "More examples", "/usage/spacy-101#lightning-tour").
# Install: pip install spacy && python -m spacy download en
+code-exec("Edit the code & try spaCy", true).
# pip install spacy
# python -m spacy download en_core_web_sm
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
# Process whole documents
text = open('customer_feedback_627.txt').read()
text = (u"When Sebastian Thrun started working on self-driving cars at "
u"Google in 2007, few people outside of the company took him "
u"seriously. “I can tell you very senior CEOs of major American "
u"car companies would shake my hand and turn away because I wasnt "
u"worth talking to,” said Thrun, now the co-founder and CEO of "
u"online higher education startup Udacity, in an interview with "
u"Recode earlier this week.")
doc = nlp(text)
# Find named entities, phrases and concepts
@ -69,12 +77,10 @@ include _includes/_mixins
print(entity.text, entity.label_)
# Determine semantic similarities
doc1 = nlp(u'the fries were gross')
doc2 = nlp(u'worst fries ever')
doc1.similarity(doc2)
# Hook in your own deep learning models
nlp.add_pipe(load_my_model(), before='parser')
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)
+grid-col("third")
+h(2) Features

View File

@ -1,8 +1,7 @@
{
"sidebar": {
"Models": {
"Overview": "./",
"Comparison": "comparison"
"Overview": "./"
},
"Language models": {
@ -72,7 +71,8 @@
"vecs": "Word vectors included in the model. Models that only support context vectors compute similarity via the tensors shared with the pipeline.",
"benchmark_parser": "Syntax accuracy",
"benchmark_ner": "NER accuracy",
"benchmark_speed": "Speed"
"benchmark_speed": "Speed",
"compat": "Latest compatible model version for your spaCy installation"
},
"MODEL_LICENSES": {
@ -92,6 +92,11 @@
"ner": { "ents_f": "NER F", "ents_p": "NER P", "ents_r": "NER R" }
},
"EXAMPLE_SENT_LANGS": [
"da", "de", "en", "es", "fa", "fr", "he", "hi", "hu", "id", "it", "ja",
"nb", "nl", "pl", "pt", "ru", "sv", "tr", "zh"
],
"LANGUAGES": {
"en": "English",
"de": "German",

View File

@ -1,83 +0,0 @@
//- 💫 DOCS > MODELS > COMPARISON
include ../_includes/_mixins
p
| This experimental tool helps you compare spaCy's statistical models
| by features, accuracy and speed. This can be especially useful to get an
| idea of the trade-offs between larger and smaller models of the same
| type. For example, #[code lg] models tend to be more accurate than
| the corresponding #[code sm] versions but they're often significantly
| larger in file size and memory usage.
- TPL = "compare"
+grid.o-box
for i in [1, 2]
+grid-col("half", "no-gutter")
label.u-heading.u-text-label.u-text-center.u-color-theme(for="model#{i}") Model #{i}
.o-field.o-grid.o-grid--vcenter.u-padding-small
select.o-field__select.u-text-small(id="model#{i}" data-tpl=TPL data-tpl-key="model#{i}")
option(selected="" disabled="" value="") Select model...
for models, _ in MODELS
for model in models
option(value=model)=model
div(data-tpl=TPL data-tpl-key="error")
+infobox
| Unable to load model details and accuracy figures from GitHub to
| compare the models. For details of the individual models, see the
| overview of the
| #[+a(gh("spacy-models") + "/releases") latest model releases].
div(data-tpl=TPL data-tpl-key="result" hidden="")
+chart("compare_accuracy", 350)
+aside-code("Download", "text")
for i in [1, 2]
span(data-tpl=TPL data-tpl-key="download#{i}")
+table.o-block-small(data-tpl=TPL data-tpl-key="table")
+row("head")
+head-cell
for i in [1, 2]
+head-cell(style="width: 40%")
a(data-tpl=TPL data-tpl-key="link#{i}")
code(data-tpl=TPL data-tpl-key="table-head#{i}" style="text-transform: initial; font-weight: normal")
for label, id in {lang: "Language", type: "Type", genre: "Genre"}
+row
+cell #[+label=label]
for i in [1, 2]
+cell(data-tpl=TPL data-tpl-key="#{id}#{i}") n/a
for label in ["Version", "Size", "Pipeline", "Vectors", "Sources", "Author", "License"]
- var field = label.toLowerCase()
if field == "vectors"
- field = "vecs"
+row
+cell.u-nowrap
+label=label
if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle]
for i in [1, 2]
+cell
span(data-tpl=TPL data-tpl-key=field + i) #[em n/a]
+row
+cell #[+label Description]
for i in [1, 2]
+cell.u-text-tiny(data-tpl=TPL data-tpl-key="desc#{i}") n/a
for benchmark, _ in MODEL_BENCHMARKS
- var counter = 0
for label, field in benchmark
+row((counter == 0) ? "divider" : null)
+cell.u-nowrap
+label=label
if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle]
for i in [1, 2]
+cell
span(data-tpl=TPL data-tpl-key=field + i) n/a
- counter++

View File

@ -16,12 +16,12 @@
"scripts": {
"check_links": "blc https://spacy.io -ro",
"rollup": "rollup www/assets/js/rollup.js --output.format iife --output.file www/assets/js/rollup.js",
"babel": "babel www/assets/js/rollup.js --out-file www/assets/js/rollup.js --presets=es2015",
"uglify": "uglifyjs www/assets/js/rollup.js --output www/assets/js/rollup.js",
"rollup": "rollup www/assets/js/main.js --output.format iife --output.file www/assets/js/main.js",
"babel": "babel www/assets/js/main.js --out-file www/assets/js/main.js --presets=es2015",
"uglify": "uglifyjs www/assets/js/main.js --output www/assets/js/main.js",
"compile": "NODE_ENV=deploy harp compile",
"bundle": "npm run rollup && npm run babel && npm run uglify",
"deploy": "rsync -P --compress --recursive --checksum --delete --exclude=\".*\" --exclude=\"README.html\" --exclude \"package.json\" --exclude \"www\" www/ $1"
"deploy": "rsync -P --compress --recursive --checksum --delete --exclude=\".*\" --exclude=\"*.vue.js\" --exclude=\"README.html\" --exclude \"package.json\" --exclude \"www\" www/ $1"
}
}

View File

@ -368,29 +368,6 @@ include _includes/_mixins
+accordion("Section " + i)
p Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque enim ante, pretium a orci eget, varius dignissim augue. Nam eu dictum mauris, id tincidunt nisi. Integer commodo pellentesque tincidunt. Nam at turpis finibus tortor gravida sodales tincidunt sit amet est. Nullam euismod arcu in tortor auctor, sit amet dignissim justo congue. Sed varius vel dolor id dapibus. Curabitur pharetra fermentum arcu nec suscipit. Duis aliquet eros lorem, vitae interdum magna tempus quis.
+h(3, "chart") Chart
p
| Charts are powered by #[+a("http://www.chartjs.org") chart.js] and
| implemented via a mixin that creates the #[code canvas] element and
| assigns the chart ID. The chart data itself is supplied in JavaScript.
| Charts are mostly used to visualise and compare model accuracy scores
| and speed benchmarks.
+aside-code("Usage", "jade").
+chart("accuracy")
script(src="/assets/js/chart.min.js")
script new Chart('chart_accuracy', { datasets: [] })
+chart("accuracy", 400)
+chart("speed", 300)
script(src="/assets/js/vendor/chart.min.js")
script.
Chart.defaults.global.defaultFontFamily = "-apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'";
new Chart('chart_accuracy', { type: 'bar', options: { legend: { position: 'bottom'}, responsive: true, scales: { yAxes: [{ label: 'Accuracy', ticks: { suggestedMin: 70 } }], xAxes: [{ barPercentage: 0.75 }]}}, data: { labels: ['UAS', 'LAS', 'POS', 'NER F', 'NER P', 'NER R'], datasets: [{ label: 'en_core_web_sm', data: [91.65, 89.77, 97.05, 84.80, 84.53, 85.06], backgroundColor: '#09a3d5' }, { label: 'en_core_web_lg', data: [91.49, 89.66, 97.23, 86.46, 86.78, 86.15], backgroundColor: '#066B8C'}]}});
new Chart('chart_speed', { type: 'horizontalBar', options: { legend: { position: 'bottom'}, responsive: true, scales: { xAxes: [{ label: 'Speed', ticks: { suggestedMin: 0 }}], yAxes: [{ barPercentage: 0.75 }]}}, data: { labels: ['w/s CPU', 'w/s GPU'], datasets: [{ label: 'en_core_web_sm', data: [9575, 25531], backgroundColor: '#09a3d5'}, { label: 'en_core_web_lg', data: [8421, 22092], backgroundColor: '#066B8C'}]}});
+section("embeds")
+h(2, "embeds") Embeds
@ -560,15 +537,15 @@ include _includes/_mixins
| the sub-navigation in the sidebar and maps titles to section IDs.
+code(false, "json").
"resources": {
"title": "Resources",
"teaser": "Libraries, demos, books, courses and research systems featuring spaCy.",
"v2": {
"title": "What's New in v2.0",
"teaser": "New features, backwards incompatibilities and migration guide.",
"menu": {
"Third-party libraries": "libraries",
"Demos & Visualizations": "demos",
"Books & Courses": "books",
"Jupyter Notebooks": "notebooks",
"Research": "research"
"Summary": "summary",
"New features": "features",
"Backwards Incompatibilities": "incompat",
"Migrating from v1.x": "migrating",
"Benchmarks": "benchmarks"
}
}

View File

@ -0,0 +1,95 @@
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# spaCy Universe
The [spaCy Universe](https://spacy.io/universe) collects the many great resources developed with or for spaCy. It
includes standalone packages, plugins, extensions, educational materials,
operational utilities and bindings for other languages.
If you have a project that you want the spaCy community to make use of, you can
suggest it by submitting a pull request to this repository. The Universe
database is open-source and collected in a simple JSON file.
Looking for inspiration for your own spaCy plugin or extension? Check out the
[`project idea`](https://github.com/explosion/spaCy/labels/project%20idea) label
on the issue tracker.
## Checklist
### Projects
✅ Libraries and packages should be **open-source** (with a user-friendly license) and at least somewhat **documented** (e.g. a simple `README` with usage instructions).
✅ We're happy to include work in progress and prereleases, but we'd like to keep the emphasis on projects that should be useful to the community **right away**.
✅ Demos and visualizers should be available via a **public URL**.
### Educational Materials
✅ Books should be **available for purchase or download** (not just pre-order). Ebooks and self-published books are fine, too, if they include enough substantial content.
✅ The `"url"` of book entries should either point to the publisher's website or a reseller of your choice (ideally one that ships worldwide or as close as possible).
✅ If an online course is only available behind a paywall, it should at least have a **free excerpt** or chapter available, so users know what to expect.
## JSON format
To add a project, fork this repository, edit the [`universe.json`](universe.json)
and add an object of the following format to the list of `"resources"`. Before
you submit your pull request, make sure to use a linter to verify that your
markup is correct. We'll also be adding linting for the `universe.json` to our
automated GitHub checks soon.
```json
{
"id": "unique-project-id",
"title": "Project title",
"slogan": "A short summary",
"description": "A longer description *Mardown allowed!*",
"github": "user/repo",
"pip": "package-name",
"code_example": [
"import spacy",
"import package_name",
"",
"nlp = spacy.load('en')",
"nlp.add_pipe(package_name)"
],
"code_language": "python",
"url": "https://example.com",
"thumb": "https://example.com/thumb.jpg",
"image": "https://example.com/image.jpg",
"author": "Your Name",
"author_links": {
"twitter": "username",
"github": "username",
"website": "https://example.com"
},
"category": ["pipeline", "standalone"],
"tags": ["some-tag", "etc"]
}
```
| Field | Type | Description |
| --- | --- | --- |
| `id` | string | Unique ID of the project. |
| `title` | string | Project title. If not set, the `id` will be used as the display title. |
| `slogan` | string | A short description of the project. Displayed in the overview and under the title. |
| `description` | string | A longer description of the project. Markdown is allowed, but should be limited to basic formatting like bold, italics, code or links. |
| `github` | string | Associated GitHub repo in the format `user/repo`. Will be displayed as a link and used for release, license and star badges. |
| `pip` | string | Package name on pip. If available, the installation command will be displayed. |
| `cran` | string | For R packages: package name on CRAN. If available, the installation command will be displayed. |
| `code_example` | array | Short example that shows how to use the project. Formatted as an array with one string per line. |
| `code_language` | string | Defaults to `'python'`. Optional code language used for syntax highlighting with [Prism](http://prismjs.com/). |
| `url` | string | Optional project link to display as button. |
| `thumb` | string | Optional URL to project thumbnail to display in overview and project header. Recommended size is 100x100px. |
| `image` | string | Optional URL to project image to display with description. |
| `author` | string | Name(s) of project author(s). |
| `author_links` | object | Usernames and links to display as icons to author info. Currently supports `twitter` and `github` usernames, as well as `website` link. |
| `category` | list | One or more categories to assign to project. Must be one of the available options. |
| `tags` | list | Still experimental and not used for filtering: one or more tags to assign to project. |
To separate them from the projects, educational materials also specify
`"type": "education`. Books can also set a `"cover"` field containing a URL
to a cover image. If available, it's used in the overview and displayed on
the individual book page.

View File

@ -0,0 +1,9 @@
{
"index": {
"title": "Universe",
"teaser": "This section collects the many great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities and bindings for other languages.",
"description": "This section collects the many great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities and bindings for other languages.",
"landing": true,
"preview": "universe"
}
}

146
website/universe/index.jade Normal file
View File

@ -0,0 +1,146 @@
//- 💫 DOCS > UNIVERSE
include ../_includes/_mixins
mixin sidebar-section(title)
ul.c-sidebar__section.o-block-small
if title
li.u-text-label.u-color-dark=title
block
section(data-vue="universe")
menu.c-sidebar.js-sidebar.u-text
+sidebar-section("Overview")
li.c-sidebar__item
a.u-hand(v-on:click="filterBy('all')" v-bind:class="{'is-active': activeMenu == 'all'}") All Projects
+sidebar-section("Projects")
li.c-sidebar__item(v-for="(data, id) in projectCats")
a.u-hand(v-on:click="filterBy(id, 'category')" v-text="data.title" v-bind:class="{ 'is-active': activeMenu == id }")
+sidebar-section("Education")
li.c-sidebar__item(v-for="(data, id) in educationCats")
a.u-hand(v-on:click="filterBy(id, 'category')" v-text="data.title" v-bind:class="{ 'is-active': activeMenu == id }")
main.o-main.o-main--sidebar.o-main--aside
article.o-content
transition-group(name="u-fade")
section(v-if="selected" key="selected" v-cloak="")
+h(1).u-heading--title
.u-float-right.o-thumb(v-if="selected.thumb")
img(v-bind:src="selected.thumb" width="100" role="presentation")
| {{ selected.title || selected.id }}
.u-heading__teaser.u-text-small.u-color-dark.o-block-small(v-if="selected.slogan") {{ selected.slogan }}
p(v-if="selected.github")
a.u-hide-link(v-bind:href="`https://github.com/${selected.github}`")
| #[img.o-badge(v-bind:src="`https://img.shields.io/github/release/${selected.github}/all.svg?style=flat-square`")]
| #[img.o-badge(v-bind:src="`https://img.shields.io/github/license/${selected.github}.svg?style=flat-square`")]
| #[img(v-bind:src="`https://img.shields.io/github/stars/${selected.github}.svg?style=social&label=Stars`")]
div(v-if="selected.pip")
+aside-code("Installation", "bash", "$").
pip install {{ selected.pip }}
div(v-else-if="selected.cran")
+aside-code("Installation", "r").
install.packages("{{ selected.cran }}")
+section.o-section--small
img.o-block-small.u-padding-medium.u-float-right(v-if="selected.cover" v-bind:src="selected.cover" v-bind:alt="selected.title" width="250" style="max-width: 50%")
.x-markdown.o-block(v-if="selected.description")
vue-markdown(v-bind:source="selected.description")
.o-block(v-if="selected.code_example")
+code("Example", "none")(v-bind:class="`lang-${selected.code_language||'#{DEFAULT_SYNTAX}'}`")
| {{ selected.code_example.join('\n') }}
figure.o-block.u-text(v-if="selected.image")
img(v-bind:src="selected.image" width="800" v-bind:alt="selected.slogan || selected.title || selected.id")
p(v-if="selected.url")
+button("", false, "primary", "small")(target="_blank" v-bind:href="selected.url") View more
+grid
+grid-col("half")(v-if="selected.author")
+label Author info
p.o-inline-list
span {{ selected.author }}
span.u-color-subtle-dark(v-if="selected.author_links") &nbsp;
span(v-for="id in ['github', 'twitter', 'website']" v-if="selected.author_links[id]")
a.u-hide-link(rel="noopener nofollow" v-bind:href="getAuthorLink(id, selected.author_links[id])" v-bind:aria-label="id")
svg.o-icon(aria-hidden="true" viewBox="0 0 18 18" width="18" height="18")
use(v-bind:xlink:href="`#svg_${id}`")
| &nbsp;
+grid-col("half")(v-if="selected.github")
+label GitHub
p.o-no-block
span.u-inline-block.u-nowrap
+a("", false)(target="_blank" v-bind:href="`https://github.com/${selected.github}`")
code.u-break.u-break--all(v-text="selected.github")
| #[+icon("code", 16).o-icon--inline.u-color-theme]
+grid-col("full")(v-if="selected.category")
+label Categories
p.o-no-block
span(v-for="cat in selected.category" v-if="categories[cat]")
a.u-text.u-hand(v-on:click="filterBy(cat, 'category')")
code(v-text="cat")
| &nbsp;
section(v-else="" key="overview")
+h(1).u-heading--title
span(v-if="activeMenu && categories[activeMenu]" v-cloak="")
| {{ categories[activeMenu].title }}
+tag {{ resources.length }}
.u-heading__teaser.u-text-small.u-color-dark(v-if="categories[activeMenu].description" v-text="categories[activeMenu].description")
span(v-else)=title
.u-heading__teaser.u-text-small.u-color-dark=teaser
+section().o-section--small
+infobox()(v-if="false")
| Couldn't load the projects overview. This may
| happen if there's a bug in our code, or if you
| have JavaScript disabled. The resources list
| displayed on this page is open-source and
| available on GitHub see
| #[+src(gh("spacy", "website/universe/universe.json")) #[code universe.json]]
| for the full data.
+grid()(v-cloak="" v-bind:data-loading="loading")
+grid-col().u-text(v-for="resource in resources" v-bind:key="resource.id" v-bind:class="{'o-box': !resource.cover, 'o-grid__col--third': resource.cover, 'o-grid__col--half': !resource.cover}" v-if="(activeMenu && activeMenu != 'all') || resource.type != 'education'")
a.u-hand(v-on:click="viewResource(resource.id)")
img(v-bind:src="resource.cover" v-bind:alt="resource.title" v-if="resource.cover")
div(v-else)
+h(5).o-block-small
.o-thumb.o-thumb--small.u-float-right(v-if="resource.thumb")
img(v-bind:src="resource.thumb" width="35" role="presentation")
span {{ resource.title || resource.id }}
.u-text-small.o-no-block(v-if="resource.slogan" v-text="resource.slogan")
+section().o-section--small
+h(3) Submit your project
p
| If you have a project that you want the spaCy
| community to make use of, you can suggest it by
| submitting a pull request to the spaCy website
| repository. The Universe database is open-source
| and collected in a simple JSON file. For more
| details on the formats and available fields, see
| the documentation. Looking for inspiration your
| own spaCy plugin or extension? Check out the
| #[+a(gh("spacy") + "/labels/project%20idea") #[code project idea]]
| label on the issue tracker.
p.o-inline-list
+button(gh("spacy", "website/universe/README.md"), false, "small", "primary") Read the docs
+button(gh("spacy", "website/universe/universe.json"), false, "small", "secondary") JSON source #[+icon("code", 16)]
include ../_includes/_footer

View File

@ -0,0 +1,871 @@
{
"resources": [
{
"id": "spacymoji",
"slogan": "Emoji handling and meta data as a spaCy pipeline component",
"github": "ines/spacymoji",
"description": "spaCy v2.0 extension and pipeline component for adding emoji meta data to `Doc` objects. Detects emoji consisting of one or more unicode characters, and can optionally merge multi-char emoji (combined pictures, emoji with skin tone modifiers) into one token. Human-readable emoji descriptions are added as a custom attribute, and an optional lookup table can be provided for your own descriptions. The extension sets the custom `Doc`, `Token` and `Span` attributes `._.is_emoji`, `._.emoji_desc`, `._.has_emoji` and `._.emoji`.",
"pip": "spacymoji",
"category": ["pipeline"],
"tags": ["emoji", "unicode"],
"thumb": "https://i.imgur.com/XOTYIgn.jpg",
"code_example": [
"import spacy",
"from spacymoji import Emoji",
"",
"nlp = spacy.load('en')",
"emoji = Emoji(nlp)",
"nlp.add_pipe(emoji, first=True)",
"",
"doc = nlp(u'This is a test 😻 👍🏿')",
"assert doc._.has_emoji == True",
"assert doc[2:5]._.has_emoji == True",
"assert doc[0]._.is_emoji == False",
"assert doc[4]._.is_emoji == True",
"assert doc[5]._.emoji_desc == u'thumbs up dark skin tone'",
"assert len(doc._.emoji) == 2",
"assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')"
],
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
}
},
{
"id": "spacy_hunspell",
"slogan": "Add spellchecking and spelling suggestions to your spaCy pipeline using Hunspell",
"description": "This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add [Hunspell](http://hunspell.github.io) support for spellchecking.",
"github": "tokestermw/spacy_hunspell",
"pip": "spacy_hunspell",
"code_example": [
"import spacy",
"from spacy_hunspell import spaCyHunSpell",
"",
"nlp = spacy.load('en_core_web_sm')",
"hunspell = spaCyHunSpell(nlp, 'mac')",
"nlp.add_pipe(hunspell)",
"doc = nlp('I can haz cheezeburger.')",
"haz = doc[2]",
"haz._.hunspell_spell # False",
"haz._.hunspell_suggest # ['ha', 'haze', 'hazy', 'has', 'hat', 'had', 'hag', 'ham', 'hap', 'hay', 'haw', 'ha z']"
],
"author": "Motoki Wu",
"author_links": {
"github": "tokestermw",
"twitter": "plusepsilon"
},
"category": ["pipeline"],
"tags": ["spellcheck"]
},
{
"id": "spacy_grammar",
"slogan": "Language Tool style grammar handling with spaCy",
"description": "This packages leverages the [Matcher API](https://spacy.io/docs/usage/rule-based-matching) in spaCy to quickly match on spaCy tokens not dissimilar to regex. It reads a `grammar.yml` file to load up custom patterns and returns the results inside `Doc`, `Span`, and `Token`. It is extensible through adding rules to `grammar.yml` (though currently only the simple string matching is implemented).",
"github": "tokestermw/spacy_grammar",
"code_example": [
"import spacy",
"from spacy_grammar.grammar import Grammar",
"",
"nlp = spacy.load('en')",
"grammar = Grammar(nlp)",
"nlp.add_pipe(grammar)",
"doc = nlp('I can haz cheeseburger.')",
"doc._.has_grammar_error # True"
],
"author": "Motoki Wu",
"author_links": {
"github": "tokestermw",
"twitter": "plusepsilon"
},
"category": ["pipeline"]
},
{
"id": "spacy_kenlm",
"slogan": "KenLM extension for spaCy 2.0",
"github": "tokestermw/spacy_kenlm",
"pip": "spacy_kenlm",
"code_example": [
"import spacy",
"from spacy_kenlm import spaCyKenLM",
"",
"nlp = spacy.load('en_core_web_sm')",
"spacy_kenlm = spaCyKenLM() # default model from test.arpa",
"nlp.add_pipe(spacy_kenlm)",
"doc = nlp('How are you?')",
"doc._.kenlm_score # doc score",
"doc[:2]._.kenlm_score # span score",
"doc[2]._.kenlm_score # token score"
],
"author": "Motoki Wu",
"author_links": {
"github": "tokestermw",
"twitter": "plusepsilon"
},
"category": ["pipeline"]
},
{
"id": "spacy_readability",
"slogan": "Add text readability meta data to Doc objects",
"description": "spaCy v2.0 pipeline component for calculating readability scores of of text. Provides scores for Flesh-Kincaid grade level, Flesh-Kincaid reading ease, and Dale-Chall.",
"github": "mholtzscher/spacy_readability",
"pip": "spacy-readability",
"code_example": [
"import spacy",
"from spacy_readability import Readability",
"",
"nlp = spacy.load('en')",
"read = Readability(nlp)",
"nlp.add_pipe(read, last=True)",
"doc = nlp(\"I am some really difficult text to read because I use obnoxiously large words.\")",
"doc._.flesch_kincaid_grade_level",
"doc._.flesch_kincaid_reading_ease",
"doc._.dale_chall"
],
"author": "Michael Holtzscher",
"author_links": {
"github": "mholtzscher"
},
"category": ["pipeline"]
},
{
"id": "spacy-sentence-segmenter",
"title": "Sentence Segmenter",
"slogan": "Custom sentence segmentation for spaCy",
"code_example": [
"from seg.newline.segmenter import NewLineSegmenter",
"import spacy",
"",
"nlseg = NewLineSegmenter()",
"nlp = spacy.load('en')",
"nlp.add_pipe(nlseg.set_sent_starts, name='sentence_segmenter', before='parser')",
"doc = nlp(my_doc_text)"
],
"author": "tc64",
"author_link": {
"github": "tc64"
},
"category": ["pipeline"]
},
{
"id": "spacy_cld",
"title": "spaCy-CLD",
"slogan": "Add language detection to your spaCy pipeline using CLD2",
"description": "spaCy-CLD operates on `Doc` and `Span` spaCy objects. When called on a `Doc` or `Span`, the object is given two attributes: `languages` (a list of up to 3 language codes) and `language_scores` (a dictionary mapping language codes to confidence scores between 0 and 1).\n\nspacy-cld is a little extension that wraps the [PYCLD2](https://github.com/aboSamoor/pycld2) Python library, which in turn wraps the [Compact Language Detector 2](https://github.com/CLD2Owners/cld2) C library originally built at Google for the Chromium project. CLD2 uses character n-grams as features and a Naive Bayes classifier to identify 80+ languages from Unicode text strings (or XML/HTML). It can detect up to 3 different languages in a given document, and reports a confidence score (reported in with each language.",
"github": "nickdavidhaynes/spacy-cld",
"pip": "spacy_cld",
"code_example": [
"import spacy",
"from spacy_cld import LanguageDetector",
"",
"nlp = spacy.load('en')",
"language_detector = LanguageDetector()",
"nlp.add_pipe(language_detector)",
"doc = nlp('This is some English text.')",
"",
"doc._.languages # ['en']",
"doc._.language_scores['en'] # 0.96"
],
"author": "Nicholas D Haynes",
"author_links": {
"github": "nickdavidhaynes"
},
"category": ["pipeline"]
},
{
"id": "spacy-lookup",
"slogan": "A powerful entity matcher for very large dictionaries, using the FlashText module",
"description": "spaCy v2.0 extension and pipeline component for adding Named Entities metadata to `Doc` objects. Detects Named Entities using dictionaries. The extension sets the custom `Doc`, `Token` and `Span` attributes `._.is_entity`, `._.entity_type`, `._.has_entities` and `._.entities`. Named Entities are matched using the python module `flashtext`, and looked up in the data provided by different dictionaries.",
"github": "mpuig/spacy-lookup",
"pip": "spacy-lookup",
"code_example": [
"import spacy",
"from spacy_lookup import Entity",
"",
"nlp = spacy.load('en')",
"entity = Entity(nlp, keywords_list=['python', 'java platform'])",
"nlp.add_pipe(entity, last=True)",
"",
"doc = nlp(u\"I am a product manager for a java and python.\")",
"assert doc._.has_entities == True",
"assert doc[2:5]._.has_entities == True",
"assert doc[0]._.is_entity == False",
"assert doc[3]._.is_entity == True",
"print(doc._.entities)"
],
"author": "Marc Puig",
"author_links": {
"github": "mpuig"
},
"category": ["pipeline"]
},
{
"id": "spacy-iwnlp",
"slogan": "German lemmatization with IWNLP",
"description": "This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add [IWNLP-py](https://github.com/Liebeck/iwnlp-py) as German lemmatizer directly into your spaCy pipeline.",
"github": "Liebeck/spacy-iwnlp",
"pip": "spacy-iwnlp",
"code_example": [
"import spacy",
"from spacy_iwnlp import spaCyIWNLP",
"",
"nlp = spacy.load('de')",
"iwnlp = spaCyIWNLP(lemmatizer_path='data/IWNLP.Lemmatizer_20170501.json')",
"nlp.add_pipe(iwnlp)",
"doc = nlp('Wir mögen Fußballspiele mit ausgedehnten Verlängerungen.')",
"for token in doc:",
" print('POS: {}\tIWNLP:{}'.format(token.pos_, token._.iwnlp_lemmas))"
],
"author": "Matthias Liebeck",
"author_links": {
"github": "Liebeck"
},
"category": ["pipeline"],
"tags": ["lemmatizer", "german"]
},
{
"id": "spacy-sentiws",
"slogan": "German sentiment scores with SentiWS",
"description": "This package uses the [spaCy 2.0 extensions](https://spacy.io/usage/processing-pipelines#extensions) to add [SentiWS](http://wortschatz.uni-leipzig.de/en/download) as German sentiment score directly into your spaCy pipeline.",
"github": "Liebeck/spacy-sentiws",
"pip": "spacy-sentiws",
"code_example": [
"import spacy",
"from spacy_sentiws import spaCySentiWS",
"",
"nlp = spacy.load('de')",
"sentiws = spaCySentiWS(sentiws_path='data/sentiws/')",
"nlp.add_pipe(sentiws)",
"doc = nlp('Die Dummheit der Unterwerfung blüht in hübschen Farben.')",
"",
"for token in doc:",
" print('{}, {}, {}'.format(token.text, token._.sentiws, token.pos_))"
],
"author": "Matthias Liebeck",
"author_links": {
"github": "Liebeck"
},
"category": ["pipeline"],
"tags": ["sentiment", "german"]
},
{
"id": "spacy-lefff",
"slogan": "French lemmatization with Lefff",
"description": "spacy v2.0 extension and pipeline component for adding a French lemmatizer based on [Lefff](https://hal.inria.fr/inria-00521242/).",
"github": "sammous/spacy-lefff",
"pip": "spacy-lefff",
"code_example": [
"import spacy",
"from spacy_lefff import LefffLemmatizer",
"",
"nlp = spacy.load('fr')",
"french_lemmatizer = LefffLemmatizer()",
"nlp.add_pipe(french_lemmatizer, name='lefff', after='parser')",
"doc = nlp(u\"Paris est une ville très chère.\")",
"for d in doc:",
" print(d.text, d.pos_, d._.lefff_lemma, d.tag_)"
],
"author": "Sami Moustachir",
"author_links": {
"github": "sammous"
},
"category": ["pipeline"],
"tags": ["lemmatizer", "french"]
},
{
"id": "lemmy",
"title": "Lemmy",
"slogan": "A Danish lemmatizer",
"description": "Lemmy is a lemmatizer for Danish 🇩🇰 . It comes already trained on Dansk Sprognævns (DSN) word list (fuldformliste) and the Danish Universal Dependencies and is ready for use. Lemmy also supports training on your own dataset. The model currently included in Lemmy was evaluated on the Danish Universal Dependencies dev dataset and scored an accruacy > 99%.\n\nYou can use Lemmy as a spaCy extension, more specifcally a spaCy pipeline component. This is highly recommended and makes the lemmas easily accessible from the spaCy tokens. Lemmy makes use of POS tags to predict the lemmas. When wired up to the spaCy pipeline, Lemmy has the benefit of using spaCys builtin POS tagger.",
"github": "sorenlind/lemmy",
"pip": "lemmy",
"code_example": [
"import da_custom_model as da # name of your spaCy model",
"import lemmy.pipe",
"nlp = da.load()",
"",
"# create an instance of Lemmy's pipeline component for spaCy",
"pipe = lemmy.pipe.load()",
"",
"# add the comonent to the spaCy pipeline.",
"nlp.add_pipe(pipe, after='tagger')",
"",
"# lemmas can now be accessed using the `._.lemma` attribute on the tokens",
"nlp(\"akvariernes\")[0]._.lemma"
],
"thumb": "https://i.imgur.com/RJVFRWm.jpg",
"author": "Søren Lind Kristiansen",
"author_links": {
"github": "sorenlind"
},
"category": ["pipeline"],
"tags": ["lemmatizer", "danish"]
},
{
"id": "wmd-relax",
"slogan": "Calculates word mover's distance insanely fast",
"description": "Calculates Word Mover's Distance as described in [From Word Embeddings To Document Distances](http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf) by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.\n\n⚠ **This package is currently only compatible with spaCy v.1x.**",
"github": "src-d/wmd-relax",
"thumb": "https://i.imgur.com/f91C3Lf.jpg",
"code_example": [
"import spacy",
"import wmd",
"",
"nlp = spacy.load('en', create_pipeline=wmd.WMD.create_spacy_pipeline)",
"doc1 = nlp(\"Politician speaks to the media in Illinois.\")",
"doc2 = nlp(\"The president greets the press in Chicago.\")",
"print(doc1.similarity(doc2))"
],
"author": "source{d}",
"author_links": {
"github": "src-d",
"twitter": "sourcedtech",
"website": "https://sourced.tech"
},
"category": ["pipeline"]
},
{
"id": "neuralcoref",
"slogan": "State-of-the-art coreference resolution based on neural nets and spaCy",
"description": "This coreference resolution module is based on the super fast [spaCy](https://spacy.io/) parser and uses the neural net scoring model described in [Deep Reinforcement Learning for Mention-Ranking Coreference Models](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf) by Kevin Clark and Christopher D. Manning, EMNLP 2016. With ✨Neuralcoref v2.0, you should now be able to train the coreference resolution system on your own datasete.g., another language than English! — **provided you have an annotated dataset**.",
"github": "huggingface/neuralcoref",
"thumb": "https://i.imgur.com/j6FO9O6.jpg",
"code_example": [
"from neuralcoref import Coref",
"",
"coref = Coref()",
"clusters = coref.one_shot_coref(utterances=u\"She loves him.\", context=u\"My sister has a dog.\")",
"mentions = coref.get_mentions()",
"utterances = coref.get_utterances()",
"resolved_utterance_text = coref.get_resolved_utterances()"
],
"author": "Hugging Face",
"author_links": {
"github": "huggingface"
},
"category": ["standalone", "conversational"],
"tags": ["coref"]
},
{
"id": "neuralcoref-vizualizer",
"title": "Neuralcoref Visualizer",
"slogan": "State-of-the-art coreference resolution based on neural nets and spaCy",
"description": "In short, coreference is the fact that two or more expressions in a text like pronouns or nouns link to the same person or thing. It is a classical Natural language processing task, that has seen a revival of interest in the past two years as several research groups applied cutting-edge deep-learning and reinforcement-learning techniques to it. It is also one of the key building blocks to building conversational Artificial intelligences.",
"url": "https://huggingface.co/coref/",
"image": "https://i.imgur.com/3yy4Qyf.png",
"thumb": "https://i.imgur.com/j6FO9O6.jpg",
"github": "huggingface/neuralcoref",
"category": ["visualizers", "conversational"],
"tags": ["coref", "chatbots"],
"author": "Hugging Face",
"author_links": {
"github": "huggingface"
}
},
{
"id": "spacy-vis",
"slogan": "A visualisation tool for spaCy using Hierplane",
"description": "A visualiser for spaCy annotations. This visualisation uses the [Hierplane](https://allenai.github.io/hierplane/) Library to render the dependency parse from spaCy's models. It also includes visualisation of entities and POS tags within nodes.",
"github": "DeNeutoy/spacy-vis",
"url": "http://spacyvis.allennlp.org/spacy-parser",
"thumb": "https://i.imgur.com/DAG9QFd.jpg",
"image": "https://raw.githubusercontent.com/DeNeutoy/spacy-vis/master/img/example.gif",
"author": "Mark Neumann",
"author_links": {
"twitter": "MarkNeumannnn",
"github": "DeNeutoy"
},
"category": ["visualizers"]
},
{
"id": "matcher-explorer",
"title": "Rule-based Matcher Explorer",
"slogan": "Test spaCy's rule-based Matcher by creating token patterns interactively",
"description": "Test spaCy's rule-based `Matcher` by creating token patterns interactively and running them over your text. Each token can set multiple attributes like text value, part-of-speech tag or boolean flags. The token-based view lets you explore how spaCy processes your text and why your pattern matches, or why it doesn't. For more details on rule-based matching, see the [documentation](https://spacy.io/usage/linguistic-features#rule-based-matching).",
"image": "https://explosion.ai/assets/img/demos/matcher.png",
"thumb": "https://i.imgur.com/rPK4AGt.jpg",
"url": "https://explosion.ai/demos/matcher",
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
},
"category": ["visualizers"]
},
{
"id": "displacy",
"title": "displaCy",
"slogan": "A modern syntactic dependency visualizer",
"description": "Visualize spaCy's guess at the syntactic structure of a sentence. Arrows point from children to heads, and are labelled by their relation type.",
"url": "https://explosion.ai/demos/displacy",
"thumb": "https://i.imgur.com/nxDcHaL.jpg",
"image": "https://explosion.ai/assets/img/demos/displacy.png",
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
},
"category": ["visualizers"]
},
{
"id": "displacy-ent",
"title": "displaCy ENT",
"slogan": "A modern named entity visualizer",
"description": "Visualize spaCy's guess at the named entities in the document. You can filter the displayed types, to only show the annotations you're interested in.",
"url": "https://explosion.ai/demos/displacy-ent",
"thumb": "https://i.imgur.com/A77Ecbs.jpg",
"image": "https://explosion.ai/assets/img/demos/displacy-ent.png",
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
},
"category": ["visualizers"]
},
{
"id": "explacy",
"slogan": "A small tool that explains spaCy parse results",
"github": "tylerneylon/explacy",
"thumb": "https://i.imgur.com/V1hCWmn.jpg",
"image": "https://raw.githubusercontent.com/tylerneylon/explacy/master/img/screenshot.png",
"code_example": [
"import spacy",
"import explacy",
"",
"nlp = spacy.load('en')",
"explacy.print_parse_info(nlp, 'The salad was surprisingly tasty.')"
],
"author": "Tyler Neylon",
"author_links": {
"github": "tylerneylon"
},
"category": ["visualizers"]
},
{
"id": "rasa",
"title": "Rasa NLU",
"slogan": "Turn natural language into structured data",
"description": "Rasa NLU (Natural Language Understanding) is a tool for understanding what is being said in short pieces of text. Rasa NLU is primarily used to build chatbots and voice apps, where this is called intent classification and entity extraction. To use Rasa, *you have to provide some training data*.",
"github": "RasaHQ/rasa_nlu",
"pip": "rasa_nlu",
"thumb": "https://i.imgur.com/ndCfKNq.png",
"url": "https://nlu.rasa.com/",
"author": "Rasa",
"author_links": {
"github": "RasaHQ"
},
"category": ["conversational"],
"tags": ["chatbots"]
},
{
"id": "tochtext",
"title": "torchtext",
"slogan": "Data loaders and abstractions for text and NLP",
"github": "pytorch/text",
"pip": "torchtext",
"thumb": "https://i.imgur.com/WFkxuPo.png",
"code_example": [
">>> pos = data.TabularDataset(",
"... path='data/pos/pos_wsj_train.tsv', format='tsv',",
"... fields=[('text', data.Field()),",
"... ('labels', data.Field())])",
"...",
">>> sentiment = data.TabularDataset(",
"... path='data/sentiment/train.json', format='json',",
"... fields={'sentence_tokenized': ('text', data.Field(sequential=True)),",
"... 'sentiment_gold': ('labels', data.Field(sequential=False))})"
],
"category": ["standalone", "research"],
"tags": ["pytorch"]
},
{
"id": "allennlp",
"title": "AllenNLP",
"slogan": "An open-source NLP research library, built on PyTorch and spaCy",
"description": "AllenNLP is a new library designed to accelerate NLP research, by providing a framework that supports modern deep learning workflows for cutting-edge language understanding problems. AllenNLP uses spaCy as a preprocessing component. You can also use Allen NLP to develop spaCy pipeline components, to add annotations to the `Doc` object.",
"github": "allenai/allennlp",
"pip": "allennlp",
"thumb": "https://i.imgur.com/U8opuDN.jpg",
"url": "http://allennlp.org",
"author": " Allen Institute for Artificial Intelligence",
"author_links": {
"github": "allenai",
"twitter": "allenai_org",
"website": "http://allenai.org"
},
"category": ["standalone", "research"]
},
{
"id": "textacy",
"slogan": "NLP, before and after spaCy",
"description": "`textacy` is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance `spacy` library. With the fundamentals tokenization, part-of-speech tagging, dependency parsing, etc. delegated to another library, `textacy` focuses on the tasks that come before and follow after.",
"github": "chartbeat-labs/textacy",
"pip": "textacy",
"url": "https://chartbeat-labs.github.io/textacy/",
"author": "Burton DeWilde",
"author_links": {
"github": "bdewilde",
"twitter": "bjdewilde"
},
"category": ["standalone"]
},
{
"id": "mordecai",
"slogan": "Full text geoparsing using spaCy, Geonames and Keras",
"description": "Extract the place names from a piece of text, resolve them to the correct place, and return their coordinates and structured geographic information.",
"github": "openeventdata/mordecai",
"pip": "mordecai",
"thumb": "https://i.imgur.com/gPJ9upa.jpg",
"code_example": [
"from mordecai import Geoparser",
"geo = Geoparser()",
"geo.geoparse(\"I traveled from Oxford to Ottawa.\")"
],
"author": "Andy Halterman",
"author_links": {
"github": "ahalterman",
"twitter": "ahalterman"
},
"category": ["standalone"]
},
{
"id": "kindred",
"title": "Kindred",
"slogan": "Biomedical relation extraction using spaCy",
"description": "Kindred is a package for relation extraction in biomedical texts. Given some training data, it can build a model to identify relations between entities (e.g. drugs, genes, etc) in a sentence.",
"github": "jakelever/kindred",
"pip": "kindred",
"code_example": [
"import kindred",
"",
"trainCorpus = kindred.bionlpst.load('2016-BB3-event-train')",
"devCorpus = kindred.bionlpst.load('2016-BB3-event-dev')",
"predictionCorpus = devCorpus.clone()",
"predictionCorpus.removeRelations()",
"classifier = kindred.RelationClassifier()",
"classifier.train(trainCorpus)",
"classifier.predict(predictionCorpus)",
"f1score = kindred.evaluate(devCorpus, predictionCorpus, metric='f1score')"
],
"author": "Jake Lever",
"author_links": {
"github": "jakelever"
},
"category": ["standalone"]
},
{
"id": "sense2vec",
"slogan": "Use NLP to go beyond vanilla word2vec",
"description": "sense2vec ([Trask et. al](https://arxiv.org/abs/1511.06388), 2015) is a nice twist on [word2vec](https://en.wikipedia.org/wiki/Word2vec) that lets you learn more interesting, detailed and context-sensitive word vectors. For an interactive example of the technology, see our [sense2vec demo](https://explosion.ai/demos/sense2vec) that lets you explore semantic similarities across all Reddit comments of 2015.",
"github": "explosion/sense2vec",
"pip": "sense2vec==1.0.0a0",
"thumb": "https://i.imgur.com/awfdhX6.jpg",
"image": "https://explosion.ai/assets/img/demos/sense2vec.png",
"url": "https://explosion.ai/demos/sense2vec",
"code_example": [
"import spacy",
"from sense2vec import Sense2VecComponent",
"",
"nlp = spacy.load('en')",
"s2v = Sense2VecComponent('/path/to/reddit_vectors-1.1.0')",
"nlp.add_pipe(s2v)",
"",
"doc = nlp(u\"A sentence about natural language processing.\")",
"assert doc[3].text == u'natural language processing'",
"freq = doc[3]._.s2v_freq",
"vector = doc[3]._.s2v_vec",
"most_similar = doc[3]._.s2v_most_similar(3)",
"# [(('natural language processing', 'NOUN'), 1.0),",
"# (('machine learning', 'NOUN'), 0.8986966609954834),",
"# (('computer vision', 'NOUN'), 0.8636297583580017)]"
],
"category": ["pipeline", "standalone", "visualizers"],
"tags": ["vectors"],
"author": "Explosion AI",
"author_links": {
"twitter": "explosion_ai",
"github": "explosion",
"website": "https://explosion.ai"
}
},
{
"id": "spacyr",
"slogan": "An R wrapper for spaCy",
"github": "quanteda/spacyr",
"cran": "spacyr",
"code_example": [
"library(\"spacyr\")",
"spacy_initialize()",
"",
"txt <- c(d1 = \"spaCy excels at large-scale information extraction tasks.\",",
" d2 = \"Mr. Smith goes to North Carolina.\")",
"",
"# process documents and obtain a data.table",
"parsedtxt <- spacy_parse(txt)"
],
"code_language": "r",
"author": "Kenneth Benoit & Aki Matsuo",
"category": ["nonpython"]
},
{
"id": "cleannlp",
"title": "CleanNLP",
"slogan": "A tidy data model for NLP in R",
"description": "The cleanNLP package is designed to make it as painless as possible to turn raw text into feature-rich data frames. the package offers four backends that can be used for parsing text: `tokenizers`, `udpipe`, `spacy` and `corenlp`.",
"github": "statsmaths/cleanNLP",
"cran": "cleanNLP",
"author": "Taylor B. Arnold",
"author_links": {
"github": "statsmaths"
},
"category": ["nonpython"]
},
{
"id": "spacy-cpp",
"slogan": "C++ wrapper library for spaCy",
"description": "The goal of spacy-cpp is to expose the functionality of spaCy to C++ applications, and to provide an API that is similar to that of spaCy, enabling rapid development in Python and simple porting to C++.",
"github": "d99kris/spacy-cpp",
"code_example": [
"Spacy::Spacy spacy;",
"auto nlp = spacy.load(\"en_core_web_sm\");",
"auto doc = nlp.parse(\"This is a sentence.\");",
"for (auto& token : doc.tokens())",
" std::cout << token.text() << \" [\" << token.pos_() << \"]\\n\";"
],
"code_language": "cpp",
"author": "Kristofer Berggren",
"author_links": {
"github": "d99kris"
},
"category": ["nonpython"]
},
{
"id": "spaCy.jl",
"slogan": "Julia interface for spaCy (work in progress)",
"github": "jekbradbury/SpaCy.jl",
"author": "James Bradbury",
"author_links": {
"github": "jekbradbury",
"twitter": "jekbradbury"
},
"category": ["nonpython"]
},
{
"id": "spacy_api",
"slogan": "Server/client to load models in a separate, dedicated process",
"github": "kootenpv/spacy_api",
"pip": "spacy_api",
"code_example": [
"from spacy_api import Client",
"",
"spacy_client = Client() # default args host/port",
"doc = spacy_client.single(\"How are you\")"
],
"author": "Pascal van Kooten",
"author_links": {
"github": "kootenpv"
},
"category": ["apis"]
},
{
"id": "spacy-api-docker",
"slogan": "spaCy REST API, wrapped in a Docker container",
"github": "jgontrum/spacy-api-docker",
"url": "https://hub.docker.com/r/jgontrum/spacyapi/",
"thumb": "https://i.imgur.com/NRnDKyj.jpg",
"code_example": [
"version: '2'",
"",
"services:",
" spacyapi:",
" image: jgontrum/spacyapi:en_v2",
" ports:",
" - \"127.0.0.1:8080:80\"",
" restart: always"
],
"code_language": "docker",
"author": "Johannes Gontrum",
"author_links": {
"github": "jgontrum"
},
"category": ["apis"]
},
{
"id": "languagecrunch",
"slogan": "NLP server for spaCy, WordNet and NeuralCoref as a Docker image",
"github": "artpar/languagecrunch",
"code_example": [
"docker run -it -p 8080:8080 artpar/languagecrunch",
"curl http://localhost:8080/nlp/parse?`echo -n \"The new twitter is so weird. Seriously. Why is there a new twitter? What was wrong with the old one? Fix it now.\" | python -c \"import urllib, sys; print(urllib.urlencode({'sentence': sys.stdin.read()}))\"`"
],
"code_language": "bash",
"author": "Parth Mudgal",
"author_links": {
"github": "artpar"
},
"category": ["apis"]
},
{
"id": "spacy-nlp",
"slogan": " Expose spaCy NLP text parsing to Node.js (and other languages) via Socket.IO",
"github": "kengz/spacy-nlp",
"thumb": "https://i.imgur.com/w41VSr7.jpg",
"code_example": [
"const spacyNLP = require(\"spacy-nlp\")",
"// default port 6466",
"// start the server with the python client that exposes spacyIO (or use an existing socketIO server at IOPORT)",
"var serverPromise = spacyNLP.server({ port: process.env.IOPORT });",
"// Loading spacy may take up to 15s"
],
"code_language": "javascript",
"author": "Wah Loon Keng",
"author_links": {
"github": "kengz"
},
"category": ["apis", "nonpython"]
},
{
"id": "prodigy",
"title": "Prodigy",
"slogan": "Radically efficient machine teaching, powered by active learning",
"description": "Prodigy is an annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration. Whether you're working on entity recognition, intent detection or image classification, Prodigy can help you train and evaluate your models faster. Stream in your own examples or real-world data from live APIs, update your model in real-time and chain models together to build more complex systems.",
"thumb": "https://i.imgur.com/UVRtP6g.jpg",
"image": "https://i.imgur.com/Dt5vrY6.png",
"url": "https://prodi.gy",
"code_example": [
"prodigy dataset ner_product \"Improve PRODUCT on Reddit data\"",
"✨ Created dataset 'ner_product'.",
"",
"prodigy ner.teach ner_product en_core_web_sm ~/data.jsonl --label PRODUCT",
"✨ Starting the web server on port 8080..."
],
"code_language": "bash",
"category": ["standalone"],
"author": "Explosion AI",
"author_links": {
"twitter": "explosion_ai",
"github": "explosion",
"website": "https://explosion.ai"
}
},
{
"id": "dragonfire",
"title": "Dragonfire",
"slogan": "An open-source virtual assistant for Ubuntu based Linux distributions",
"github": "DragonComputer/Dragonfire",
"thumb": "https://i.imgur.com/5fqguKS.jpg",
"image": "https://raw.githubusercontent.com/DragonComputer/Dragonfire/master/docs/img/demo.gif",
"author": "Dragon Computer",
"author_links": {
"github": "DragonComputer",
"website": "http://dragon.computer"
},
"category": ["standalone"]
},
{
"type": "education",
"id": "oreilly-python-ds",
"title": "Introduction to Machine Learning with Python: A Guide for Data Scientists",
"slogan": "O'Reilly, 2016",
"description": "Machine learning has become an integral part of many commercial applications and research projects, but this field is not exclusive to large companies with extensive research teams. If you use Python, even as a beginner, this book will teach you practical ways to build your own machine learning solutions. With all the data available today, machine learning applications are limited only by your imagination.",
"cover": "https://covers.oreillystatic.com/images/0636920030515/lrg.jpg",
"url": "http://shop.oreilly.com/product/0636920030515.do",
"author": "Andreas Müller, Sarah Guido",
"category": ["books"]
},
{
"type": "education",
"id": "text-analytics-python",
"title": "Text Analytics with Python",
"slogan": "Apress / Springer, 2016",
"description": "*Text Analytics with Python* teaches you the techniques related to natural language processing and text analytics, and you will gain the skills to know which technique is best suited to solve a particular problem. You will look at each technique and algorithm with both a bird's eye view to understand how it can be used as well as with a microscopic view to understand the mathematical concepts and to implement them to solve your own problems.",
"github": "dipanjanS/text-analytics-with-python",
"cover": "https://i.imgur.com/AOmzZu8.png",
"url": "https://www.amazon.com/Text-Analytics-Python-Real-World-Actionable/dp/148422387X",
"author": "Dipanjan Sarkar",
"category": ["books"]
},
{
"type": "education",
"id": "practical-ml-python",
"title": "Practical Machine Learning with Python",
"slogan": "Apress, 2017",
"description": "Master the essential skills needed to recognize and solve complex problems with machine learning and deep learning. Using real-world examples that leverage the popular Python machine learning ecosystem, this book is your perfect companion for learning the art and science of machine learning to become a successful practitioner. The concepts, techniques, tools, frameworks, and methodologies used in this book will teach you how to think, design, build, and execute machine learning systems and projects successfully.",
"github": "dipanjanS/practical-machine-learning-with-python",
"cover": "https://i.imgur.com/5F4mkt7.jpg",
"url": "https://www.amazon.com/Practical-Machine-Learning-Python-Problem-Solvers/dp/1484232062",
"author": "Dipanjan Sarkar, Raghav Bali, Tushar Sharma",
"category": ["books"]
},
{
"type": "education",
"id": "datacamp-nlp-fundamentals",
"title": "Natural Language Processing Fundamentals in Python",
"slogan": "Datacamp, 2017",
"description": "In this course, you'll learn Natural Language Processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.",
"url": "https://www.datacamp.com/courses/natural-language-processing-fundamentals-in-python",
"thumb": "https://i.imgur.com/0Zks7c0.jpg",
"author": "Katharine Jarmul",
"author_links": {
"twitter": "kjam"
},
"category": ["courses"]
},
{
"type": "education",
"id": "learning-path-spacy",
"title": "Learning Path: Mastering spaCy for Natural Language Processing",
"slogan": "O'Reilly, 2017",
"description": "spaCy, a fast, user-friendly library for teaching computers to understand text, simplifies NLP techniques, such as speech tagging and syntactic dependencies, so you can easily extract information, attributes, and objects from massive amounts of text to then document, measure, and analyze. This Learning Path is a hands-on introduction to using spaCy to discover insights through natural language processing. While end-to-end natural language processing solutions can be complex, youll learn the linguistics, algorithms, and machine learning skills to get the job done.",
"url": "https://www.safaribooksonline.com/library/view/learning-path-mastering/9781491986653/",
"thumb": "https://i.imgur.com/9MIgMAc.jpg",
"author": "Aaron Kramer",
"category": ["courses"]
}
],
"projectCats": {
"pipeline": {
"title": "Pipeline",
"description": "Custom pipeline components and extensions"
},
"conversational": {
"title": "Conversational",
"description": "Frameworks and utilities for working with conversational text, e.g. for chat bots"
},
"research": {
"title": "Research",
"description": "Frameworks and utilities for developing better NLP models, especially using neural networks"
},
"visualizers": {
"title": "Visualizers",
"description": "Demos and tools to visualize NLP annotations or systems"
},
"apis": {
"title": "Containers & APIs",
"description": "Infrastructure tools for managing or deploying spaCy"
},
"nonpython": {
"title": "Non-Python",
"description": "Wrappers, bindings and implementations in other programming languages"
},
"standalone": {
"title": "Standalone",
"description": "Self-contained libraries or tools that use spaCy under the hood"
}
},
"educationCats": {
"books": {
"title": "Books",
"description": "Books about or featuring spaCy"
},
"courses": {
"title": "Courses",
"description": "Online courses and interactive tutorials"
}
}
}

View File

@ -16,8 +16,7 @@
"Visualizers": "visualizers"
},
"In-depth": {
"Code Examples": "examples",
"Resources": "resources"
"Code Examples": "examples"
}
},
@ -25,7 +24,6 @@
"title": "Install spaCy",
"next": "models",
"quickstart": true,
"changelog": true,
"menu": {
"Quickstart": "quickstart",
"Instructions": "instructions",
@ -104,7 +102,7 @@
"menu": {
"How Pipelines Work": "pipelines",
"Custom Components": "custom-components",
"Developing Extensions": "extensions",
"Extension Attributes": "custom-components-extensions",
"Multi-Threading": "multithreading",
"Serialization": "serialization"
}
@ -147,7 +145,6 @@
"title": "Visualizers",
"tag_new": 2,
"teaser": "Visualize dependencies and entities in your browser and notebook, or export HTML.",
"next": "resources",
"menu": {
"Dependencies": "dep",
"Entities": "ent",
@ -156,25 +153,9 @@
}
},
"resources": {
"title": "Resources",
"teaser": "Libraries, demos, books, courses and research systems featuring spaCy.",
"next": "examples",
"menu": {
"Third-party libraries": "libraries",
"Extensions": "extensions",
"Demos & Visualizations": "demos",
"Books & Courses": "books",
"Jupyter Notebooks": "notebooks",
"Videos": "videos",
"Research": "research"
}
},
"examples": {
"title": "Code Examples",
"teaser": "Full code examples you can modify and run.",
"next": "resources",
"menu": {
"Information Extraction": "information-extraction",
"Pipeline": "pipeline",

View File

@ -4,9 +4,7 @@ p
| In this section, we provide benchmark accuracies for the pre-trained
| model pipelines we distribute with spaCy. Evaluations are conducted
| end-to-end from raw text, with no "gold standard" pre-processing, over
| text from a mix of genres where possible. For are more detailed
| comparison of the available models, see the new
| #[+a("/models/comparison") model comparison tool].
| text from a mix of genres where possible.
+aside("Methodology")
| The evaluation was conducted on raw text with no gold standard

View File

@ -3,20 +3,21 @@
+h(2, "changelog") Changelog
+button(gh("spacy") + "/releases", false, "secondary", "small").u-float-right.u-nowrap View releases
div(data-tpl="changelog" data-tpl-key="error" style="display: none")
+infobox
section(data-vue="changelog")
+infobox(v-if="error")
| Unable to load changelog from GitHub. Please see the
| #[+a(gh("spacy") + "/releases") releases page] instead.
section(data-tpl="changelog" data-tpl-key="table" style="display: none")
+table(["Date", "Version", "Title"])
tbody(data-tpl="changelog" data-tpl-key="releases")
+row(data-tpl="changelog" data-tpl-key="item")
+cell.u-nowrap
+label(data-changelog="date")
+cell(data-changelog="tag")
+cell.u-text-small(data-changelog="title")
section(v-if="!error" v-cloak="")
+table(["Date", "Version", "Title"])(v-if="releases.length")
+row(v-for="version in releases")
+cell.u-nowrap #[+label(v-text="version.date")]
+cell
+a(gh("spacy") + "/releases")(v-bind:href="version.url")
code(v-text="version.tag")
+cell.u-text-small(v-text="version.title")
section(v-if="!error" v-cloak="")
+h(3) Pre-releases
+aside("About pre-releases")
@ -27,5 +28,10 @@ section(data-tpl="changelog" data-tpl-key="table" style="display: none")
| on pip.
+badge("https://img.shields.io/pypi/v/spacy-nightly.svg?style=flat-square", "https://pypi.python.org/pypi/spacy-nightly")
+table(["Date", "Version", "Title"])
tbody(data-tpl="changelog" data-tpl-key="prereleases")
+table(["Date", "Version", "Title"])(v-if="prereleases.length")
+row(v-for="version in prereleases")
+cell.u-nowrap #[+label(v-text="version.date")]
+cell
+a(gh("spacy") + "/releases")(v-bind:href="version.url")
code(v-text="version.tag")
+cell.u-text-small(v-text="version.title")

View File

@ -18,9 +18,11 @@ p
| tech fund". To get the noun chunks in a document, simply iterate over
| #[+api("doc#noun_chunks") #[code Doc.noun_chunks]].
+code("Example").
nlp = spacy.load('en')
doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text)
@ -48,8 +50,11 @@ p
| attributes, the value of #[code .dep] is a hash value. You can get
| the string value with #[code .dep_].
+code("Example").
doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
@ -79,14 +84,19 @@ p
| the tree by iterating over the words in the sentence. This is usually
| the best way to match an arc of interest — from below:
+code.
+code-exec.
import spacy
from spacy.symbols import nsubj, VERB
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
verbs.add(possible_subject.head)
print(verbs)
p
| If you try to match from above, you'll have to iterate twice: once for
@ -119,12 +129,23 @@ p
| #[+api("token#n_lefts") #[code Token.n_lefts]], that give the number of
| left and right children.
+code.
doc = nlp(u'bright red apples on the tree')
assert [token.text for token in doc[2].lefts]) == [u'bright', u'red']
assert [token.text for token in doc[2].rights]) == ['on']
assert doc[2].n_lefts == 2
assert doc[2].n_rights == 1
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"bright red apples on the tree")
print([token.text for token in doc[2].lefts]) # ['bright', 'red']
print([token.text for token in doc[2].rights]) # ['on']
print(doc[2].n_lefts) # 2
print(doc[2].n_rights) # 1
+code-exec.
import spacy
nlp = spacy.load('de_core_news_sm')
doc = nlp(u"schöne rote Äpfel auf dem Baum")
print([token.text for token in doc[2].lefts]) # ['schöne', 'rote']
print([token.text for token in doc[2].rights]) # ['auf']
p
| You can get a whole phrase by its syntactic head using the
@ -141,13 +162,18 @@ p
| to be contiguous. This is not true for the German model, which has many
| #[+a(COMPANY_URL + "/blog/german-model#word-order", true) non-projective dependencies].
+code.
doc = nlp(u'Credit and mortgage account holders must submit their requests')
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Credit and mortgage account holders must submit their requests")
root = [token for token in doc if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
assert subject is descendant or subject.is_ancestor(descendant)
print(descendant.text, descendant.dep_, descendant.n_lefts, descendant.n_rights,
print(descendant.text, descendant.dep_, descendant.n_lefts,
descendant.n_rights,
[ancestor.text for ancestor in descendant.ancestors])
+table(["Text", "Dep", "n_lefts", "n_rights", "ancestors"])
@ -166,8 +192,11 @@ p
| #[strong within] the subtree — so if you use it as the end-point of a
| range, don't forget to #[code +1]!
+code.
doc = nlp(u'Credit and mortgage account holders must submit their requests')
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
span.merge()
for token in doc:
@ -199,11 +228,13 @@ p
| hook into some type of syntactic construction, just plug the sentence into
| the visualizer and see how spaCy annotates it.
+code.
+code-exec.
import spacy
from spacy import displacy
doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')
displacy.serve(doc, style='dep')
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
displacy.render(doc, style='dep', jupyter=True)
+infobox
| For more details and examples, see the

View File

@ -36,18 +36,21 @@ p
| #[code O] Token is outside an entity.#[br]
| #[code B] Token is the beginning of an entity.#[br]
+code("Example").
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'San Francisco considers banning sidewalk delivery robots')
# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
assert ents == [(u'San Francisco', 0, 13, u'GPE')]
print(ents)
# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
assert ent_san == [u'San', u'B', u'GPE']
assert ent_francisco == [u'Francisco', u'I', u'GPE']
print(ent_san) # [u'San', u'B', u'GPE']
print(ent_francisco) # [u'Francisco', u'I', u'GPE']
+table(["Text", "ent_iob", "ent_iob_", "ent_type_", "Description"])
- var style = [0, 1, 1, 1, 0]
@ -69,25 +72,30 @@ p
| to assign to the #[+api("doc#ents") #[code doc.ents]] attribute
| and create the new entity as a #[+api("span") #[code Span]].
+code("Example").
+code-exec.
import spacy
from spacy.tokens import Span
doc = nlp(u'Netflix is hiring a new VP of global policy')
# the model didn't recognise any entities :(
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"FB is hiring a new Vice President of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "FB" as an entity :(
ORG = doc.vocab.strings[u'ORG'] # get hash value of entity label
netflix_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity
doc.ents = [netflix_ent]
fb_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
assert ents == [(u'Netflix', 0, 7, u'ORG')]
print('After', ents)
# [(u'FB', 0, 2, 'ORG')] 🎉
p
| Keep in mind that you need to create a #[code Span] with the start and
| end index of the #[strong token], not the start and end index of the
| entity in the document. In this case, "Netflix" is token #[code (0, 1)]
| entity in the document. In this case, "FB" is token #[code (0, 1)]
| but at the document level, the entity will have the start and end
| indices #[code (0, 7)].
| indices #[code (0, 2)].
+h(4, "setting-from-array") Setting entity annotations from array
@ -97,19 +105,21 @@ p
| you should include both the #[code ENT_TYPE] and the #[code ENT_IOB]
| attributes in the array you're importing from.
+code.
+code-exec.
import numpy
import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE
nlp = spacy.load('en_core_web_sm')
doc = nlp.make_doc(u'London is a big city in the United Kingdom.')
assert list(doc.ents) == []
print('Before', list(doc.ents)) # []
header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)))
attr_array[0, 0] = 2 # B
attr_array[0, 0] = 3 # B
attr_array[0, 1] = doc.vocab.strings[u'GPE']
doc.from_array(header, attr_array)
assert list(doc.ents)[0].text == u'London'
print('After', list(doc.ents)) # [London]
+h(4, "setting-cython") Setting entity annotations in Cython

View File

@ -39,11 +39,11 @@ p
| callback function to invoke on a successful match. For now, we set it
| to #[code None].
+code.
+code-exec.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
# add match ID "HelloWorld" with no callback and one pattern
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
@ -51,6 +51,10 @@ p
doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # get string representation
span = doc[start:end] # the matched span
print(match_id, string_id, start, end, span.text)
p
| The matcher returns a list of #[code (match_id, start, end)] tuples in
@ -195,11 +199,11 @@ p
| much more efficient overall. The #[code Doc] patterns can contain single
| or multiple tokens.
+code.
+code-exec.
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
patterns = [nlp(text) for text in terminology_list]
@ -208,6 +212,9 @@ p
doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span.text)
p
| Since spaCy is used for processing both the patterns and the text to be
@ -228,11 +235,11 @@ p
| "Google I/O 2017". If your pattern matches, spaCy should execute your
| custom callback function #[code add_event_ent].
+code.
+code-exec.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
# Get the ID of the 'EVENT' entity type. This is required to set an entity.
@ -242,12 +249,17 @@ p
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
doc.ents += ((EVENT, start, end),)
entity = (EVENT, start, end)
doc.ents += (entity,)
print(doc[start:end].text, entity)
matcher.add('GoogleIO', add_event_ent,
[{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}],
[{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}, {'IS_DIGIT': True}])
doc = nlp(u"This is a text about Google I/O 2015.")
matches = matcher(doc)
p
| In addition to mentions of "Google I/O", your data also contains some
| annoying pre-processing artefacts, like leftover HTML line breaks
@ -256,21 +268,33 @@ p
| can easily ignore them later. So you add a second pattern and pass in a
| function #[code merge_and_flag]:
+code.
# Add a new custom flag to the vocab, which is always False by default.
# BAD_HTML_FLAG will be the flag ID, which we can use to set it to True on the span.
BAD_HTML_FLAG = nlp.vocab.add_flag(lambda text: False)
+code-exec.
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
# register a new token extension to flag bad HTML
Token.set_extension('bad_html', default=False, force=True)
def merge_and_flag(matcher, doc, i, matches):
match_id, start, end = matches[i]
span = doc[start : end]
span.merge(is_stop=True) # merge (and mark it as a stop word, just in case)
span.set_flag(BAD_HTML_FLAG, True) # set BAD_HTML_FLAG
for token in span:
token._.bad_html = True # mark token as bad HTML
print(span.text)
matcher.add('BAD_HTML', merge_and_flag,
[{'ORTH': '&lt;'}, {'LOWER': 'br'}, {'ORTH': '&gt;'}],
[{'ORTH': '&lt;'}, {'LOWER': 'br/'}, {'ORTH': '&gt;'}])
doc = nlp(u"Hello&lt;br&gt;world!")
matches = matcher(doc)
for token in doc:
print(token.text, token._.bad_html)
+aside("Tip: Visualizing matches")
| When working with entities, you can use #[+api("top-level#displacy") displaCy]
| to quickly generate a NER visualization from your updated #[code Doc],
@ -334,11 +358,11 @@ p
| use the #[+api("doc#char_span") #[code Doc.char_span]] method to
| create a #[code Span] from the character indices of the match:
+code.
+code-exec.
import spacy
import re
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely')
@ -346,6 +370,7 @@ p
for match in re.finditer(DEFINITELY_PATTERN, doc.text):
start, end = match.span() # get matched indices
span = doc.char_span(start, end) # create Span from indices
print(span.text)
p
| You can also use the regular expression with spaCy's #[code Matcher] by
@ -357,13 +382,24 @@ p
| regular expression will return #[code True] for the #[code IS_DEFINITELY]
| flag.
+code.
+code-exec.
import spacy
from spacy.matcher import Matcher
import re
nlp = spacy.load('en_core_web_sm')
definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text))
IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)
matcher = Matcher(nlp.vocab)
matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span.text)
p
| Providing the regular expressions as binary flags also lets you use them
| in combination with other token patterns for example, to match the
@ -407,11 +443,12 @@ p
| #[+a("/usage/visualizers#manual-usage") "manual" mode] lets you
| pass in a list of dictionaries containing the text and entities to render.
+code.
+code-exec.
import spacy
from spacy import displacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
matched_sents = [] # collect data of matched sentences to be visualized
@ -430,11 +467,14 @@ p
pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
{'POS': 'ADJ'}]
matcher.add('FacebookIs', collect_sents, pattern) # add pattern
matches = matcher(nlp(LOTS_OF_TEXT)) # match on your text
doc = nlp(u"I'd say that Facebook is evil. Facebook is pretty cool, right?")
matches = matcher(doc)
# serve visualization of sentences containing match with displaCy
# set manual=True to make displaCy render straight from a dictionary
displacy.serve(matched_sents, style='ent', manual=True)
# (if you're not running the code within a Jupyer environment, you can
# remove jupyter=True and use displacy.serve instead)
displacy.render(matched_sents, style='ent', manual=True, jupyter=True)
+h(3, "example2") Example: Phone numbers
@ -474,6 +514,23 @@ p
| extend, and doesn't require any training data only a set of
| test cases.
+code-exec.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'ddd'},
{'ORTH': '-', 'OP': '?'}, {'SHAPE': 'ddd'}]
matcher.add('PHONE_NUMBER', None, pattern)
doc = nlp(u"Call me at (123) 456 789 or (123) 456 789!")
print([t.text for t in doc])
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span.text)
+h(3, "example3") Example: Hashtags and emoji on social media
p
@ -509,7 +566,7 @@ p
| Valid hashtags usually consist of a #[code #], plus a sequence of
| ASCII characters with no whitespace, making them easy to match as well.
+code.
+code-exec.
from spacy.lang.en import English
from spacy.matcher import Matcher
@ -523,12 +580,27 @@ p
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]
# function to label the sentiment
def label_sentiment(matcher, doc, i, matches):
match_id, start, end = matches[i]
if doc.vocab.strings[match_id] == 'HAPPY': # don't forget to get string!
doc.sentiment += 0.1 # add 0.1 for positive sentiment
elif doc.vocab.strings[match_id] == 'SAD':
doc.sentiment -= 0.1 # subtract 0.1 for negative sentiment
matcher.add('HAPPY', label_sentiment, *pos_patterns) # add positive pattern
matcher.add('SAD', label_sentiment, *neg_patterns) # add negative pattern
# add pattern to merge valid hashtag, i.e. '#' plus any ASCII token
# add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
doc = nlp(u"Hello world 😀 #MondayMotivation")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = doc.vocab.strings[match_id] # look up string ID
span = doc[start:end]
print(string_id, span.text)
p
| Because the #[code on_match] callback receives the ID of each match, you
| can use the same function to handle the sentiment assignment for both
@ -562,26 +634,37 @@ p
span._.emoji_desc = emoji.title # assign emoji description
p
| To label the hashtags, we first need to add a new custom flag.
| #[code IS_HASHTAG] will be the flag's ID, which you can use to assign it
| to the hashtag's span, and check its value via a token's
| #[+api("token#check_flag") #[code check_flag()]] method. On each
| match, we merge the hashtag and assign the flag. Alternatively, we
| could also use a
| #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute],
| e.g. #[code token._.is_hashtag].
| To label the hashtags, we can use a
| #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute]
| set on the respective token:
+code.
# Add a new custom flag to the vocab, which is always False by default
IS_HASHTAG = nlp.vocab.add_flag(lambda text: False)
+code-exec.
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
# add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
# register token extension
Token.set_extension('is_hashtag', default=False, force=True)
doc = nlp(u"Hello world 😀 #MondayMotivation")
matches = matcher(doc)
spans = []
hashtags = []
for match_id, start, end in matches:
spans.append(doc[start : end])
for span in spans:
span.merge() # merge hashtag
span.set_flag(IS_HASHTAG, True) # set IS_HASHTAG to True
if doc.vocab.strings[match_id] == 'HASHTAG':
hashtags.append(doc[start:end])
for span in hashtags:
span.merge()
for token in span:
token._.is_hashtag = True
for token in doc:
print(token.text, token._.is_hashtag)
p
| To process a stream of social media posts, we can use

View File

@ -52,20 +52,23 @@ p
| Here's how to add a special case rule to an existing
| #[+api("tokenizer") #[code Tokenizer]] instance:
+code.
+code-exec.
import spacy
from spacy.symbols import ORTH, LEMMA, POS, TAG
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'gimme that') # phrase to tokenize
assert [w.text for w in doc] == [u'gimme', u'that'] # current tokenization
print([w.text for w in doc]) # ['gimme', 'that']
# add special case rule
special_case = [{ORTH: u'gim', LEMMA: u'give', POS: u'VERB'}, {ORTH: u'me'}]
nlp.tokenizer.add_special_case(u'gimme', special_case)
assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that']
# check new tokenization
print([w.text for w in nlp(u'gimme that')]) # ['gim', 'me', 'that']
# Pronoun lemma is returned as -PRON-!
assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'-PRON-', u'that']
print([w.lemma_ for w in nlp(u'gimme that')]) # ['give', '-PRON-', 'that']
p
| For details on spaCy's custom pronoun lemma #[code -PRON-],
@ -197,8 +200,9 @@ p
| expression object, and pass its #[code .search()] and
| #[code .finditer()] methods:
+code.
import regex as re
+code-exec.
import re
import spacy
from spacy.tokenizer import Tokenizer
prefix_re = re.compile(r'''^[\[\(&quot;&apos;]''')
@ -212,8 +216,10 @@ p
infix_finditer=infix_re.finditer,
token_match=simple_url_re.match)
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world.")
print([t.text for t in doc])
p
| If you need to subclass the tokenizer instead, the relevant methods to
@ -287,7 +293,8 @@ p
| vocabulary object. Let's say we have the following class as
| our tokenizer:
+code.
+code-exec.
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer(object):
@ -300,6 +307,11 @@ p
spaces = [True] * len(words)
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(u"What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])
p
| As you can see, we need a #[code Vocab] instance to construct this — but
| we won't have it until we get back the loaded #[code nlp] object. The
@ -307,10 +319,6 @@ p
| that you can reuse the "tokenizer factory" and initialise it with
| different instances of #[code Vocab].
+code.
nlp = spacy.load('en')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
+h(3, "own-annotations") Bringing your own annotations
p
@ -322,8 +330,15 @@ p
| specify a list of boolean values, indicating whether each word has a
| subsequent space.
+code.
doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
+code-exec.
import spacy
from spacy.tokens import Doc
from spacy.lang.en import English
nlp = English()
doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'],
spaces=[False, True, False, False])
print([(t.text, t.text_with_ws, t.whitespace_) for t in doc])
p
| If provided, the spaces list must be the same length as the words list.
@ -332,11 +347,18 @@ p
| attributes. If you don't provide a #[code spaces] sequence, spaCy will
| assume that all words are whitespace delimited.
+code.
good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
+code-exec.
import spacy
from spacy.tokens import Doc
from spacy.lang.en import English
nlp = English()
bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'])
assert bad_spaces.text == u'Hello , world !'
assert good_spaces.text == u'Hello, world!'
good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'],
spaces=[False, True, False, False])
print(bad_spaces.text) # 'Hello , world !'
print(good_spaces.text) # 'Hello, world!'
p
| Once you have a #[+api("doc") #[code Doc]] object, you can write to its

View File

@ -147,10 +147,10 @@ p
| you can also #[code import] it and then call its #[code load()] method
| with no arguments:
+code.
import en_core_web_md
+code-exec.
import en_core_web_sm
nlp = en_core_web_md.load()
nlp = en_core_web_sm.load()
doc = nlp(u'This is a sentence.')
p

View File

@ -31,14 +31,16 @@ p
| #[strong custom name]. If no name is set and no #[code name] attribute
| is present on your component, the function name is used.
+code("Adding pipeline components").
+code-exec.
import spacy
def my_component(doc):
print("After tokenization, this doc has %s tokens." % len(doc))
if len(doc) &lt; 10:
print("This is a pretty short document.")
return doc
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(my_component, name='print_info', first=True)
print(nlp.pipe_names) # ['print_info', 'tagger', 'parser', 'ner']
doc = nlp(u"This is a sentence.")
@ -47,326 +49,82 @@ p
| Of course, you can also wrap your component as a class to allow
| initialising it with custom settings and hold state within the component.
| This is useful for #[strong stateful components], especially ones which
| #[strong depend on shared data].
| #[strong depend on shared data]. In the following example, the custom
| component #[code EntityMatcher] can be initialised with #[code nlp] object,
| a terminology list and an entity label. Using the
| #[+api("phrasematcher") #[code PhraseMatcher]], it then matches the terms
| in the #[code Doc] and adds them to the existing entities.
+code.
class MyComponent(object):
name = 'print_info'
+aside("Rule-based entities vs. model", "💡")
| For complex tasks, it's usually better to train a statistical entity
| recognition model. However, statistical models require training data, so
| for many situations, rule-based approaches are more practical. This is
| especially true at the start of a project: you can use a rule-based
| approach as part of a data collection process, to help you "bootstrap" a
| statistical model.
def __init__(self, vocab, short_limit=10):
self.vocab = vocab
self.short_limit = short_limit
+code-exec.
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
class EntityMatcher(object):
name = 'entity_matcher'
def __init__(self, nlp, terms, label):
patterns = [nlp(text) for text in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
if len(doc) &lt; self.short_limit:
print("This is a pretty short document.")
matches = self.matcher(doc)
for match_id, start, end in matches:
span = Span(doc, start, end, label=match_id)
doc.ents = list(doc.ents) + [span]
return doc
my_component = MyComponent(nlp.vocab, short_limit=25)
nlp.add_pipe(my_component, first=True)
nlp = spacy.load('en_core_web_sm')
terms = (u'cat', u'dog', u'tree kangaroo', u'giant sea spider')
entity_matcher = EntityMatcher(nlp, terms, 'ANIMAL')
+h(3, "custom-components-attributes")
| Extension attributes on #[code Doc], #[code Span] and #[code Token]
+tag-new(2)
nlp.add_pipe(entity_matcher, after='ner')
print(nlp.pipe_names) # the components in the pipeline
doc = nlp(u"This is a text about Barack Obama and a tree kangaroo")
print([(ent.text, ent.label_) for ent in doc.ents])
+h(3, "custom-components-factories") Adding factories
p
| As of v2.0, spaCy allows you to set any custom attributes and methods
| on the #[code Doc], #[code Span] and #[code Token], which become
| available as #[code Doc._], #[code Span._] and #[code Token._] for
| example, #[code Token._.my_attr]. This lets you store additional
| information relevant to your application, add new features and
| functionality to spaCy, and implement your own models trained with other
| machine learning libraries. It also lets you take advantage of spaCy's
| data structures and the #[code Doc] object as the "single source of
| truth".
+aside("Why ._?")
| Writing to a #[code ._] attribute instead of to the #[code Doc] directly
| keeps a clearer separation and makes it easier to ensure backwards
| compatibility. For example, if you've implemented your own #[code .coref]
| property and spaCy claims it one day, it'll break your code. Similarly,
| just by looking at the code, you'll immediately know what's built-in and
| what's custom for example, #[code doc.sentiment] is spaCy, while
| #[code doc._.sent_score] isn't.
p
| There are three main types of extensions, which can be defined using the
| #[+api("doc#set_extension") #[code Doc.set_extension]],
| #[+api("span#set_extension") #[code Span.set_extension]] and
| #[+api("token#set_extension") #[code Token.set_extension]] methods.
+list("numbers")
+item #[strong Attribute extensions].
| Set a default value for an attribute, which can be overwritten
| manually at any time. Attribute extensions work like "normal"
| variables and are the quickest way to store arbitrary information
| on a #[code Doc], #[code Span] or #[code Token]. Attribute defaults
| behaves just like argument defaults
| #[+a("http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments") in Python functions],
| and should not be used for mutable values like dictionaries or lists.
+code-wrapper
+code.
Doc.set_extension('hello', default=True)
assert doc._.hello
doc._.hello = False
+item #[strong Property extensions].
| Define a getter and an optional setter function. If no setter is
| provided, the extension is immutable. Since the getter and setter
| functions are only called when you #[em retrieve] the attribute,
| you can also access values of previously added attribute extensions.
| For example, a #[code Doc] getter can average over #[code Token]
| attributes. For #[code Span] extensions, you'll almost always want
| to use a property otherwise, you'd have to write to
| #[em every possible] #[code Span] in the #[code Doc] to set up the
| values correctly.
+code-wrapper
+code.
Doc.set_extension('hello', getter=get_hello_value, setter=set_hello_value)
assert doc._.hello
doc._.hello = 'Hi!'
+item #[strong Method extensions].
| Assign a function that becomes available as an object method. Method
| extensions are always immutable. For more details and implementation
| ideas, see
| #[+a("/usage/examples#custom-components-attr-methods") these examples].
+code-wrapper
+code.o-no-block.
Doc.set_extension('hello', method=lambda doc, name: 'Hi {}!'.format(name))
assert doc._.hello('Bob') == 'Hi Bob!'
p
| Before you can access a custom extension, you need to register it using
| the #[code set_extension] method on the object you want
| to add it to, e.g. the #[code Doc]. Keep in mind that extensions are
| always #[strong added globally] and not just on a particular instance.
| If an attribute of the same name
| already exists, or if you're trying to access an attribute that hasn't
| been registered, spaCy will raise an #[code AttributeError].
+code("Example").
from spacy.tokens import Doc, Span, Token
fruits = ['apple', 'pear', 'banana', 'orange', 'strawberry']
is_fruit_getter = lambda token: token.text in fruits
has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])
Token.set_extension('is_fruit', getter=is_fruit_getter)
Doc.set_extension('has_fruit', getter=has_fruit_getter)
Span.set_extension('has_fruit', getter=has_fruit_getter)
+aside-code("Usage example").
doc = nlp(u"I have an apple and a melon")
assert doc[3]._.is_fruit # get Token attributes
assert not doc[0]._.is_fruit
assert doc._.has_fruit # get Doc attributes
assert doc[1:4]._.has_fruit # get Span attributes
p
| Once you've registered your custom attribute, you can also use the
| built-in #[code set], #[code get] and #[code has] methods to modify and
| retrieve the attributes. This is especially useful it you want to pass in
| a string instead of calling #[code doc._.my_attr].
+table(["Method", "Description", "Valid for", "Example"])
+row
+cell #[code ._.set()]
+cell Set a value for an attribute.
+cell Attributes, mutable properties.
+cell #[code.u-break token._.set('my_attr', True)]
+row
+cell #[code ._.get()]
+cell Get the value of an attribute.
+cell Attributes, mutable properties, immutable properties, methods.
+cell #[code.u-break my_attr = span._.get('my_attr')]
+row
+cell #[code ._.has()]
+cell Check if an attribute exists.
+cell Attributes, mutable properties, immutable properties, methods.
+cell #[code.u-break doc._.has('my_attr')]
+infobox("How the ._ is implemented")
| Extension definitions the defaults, methods, getters and setters you
| pass in to #[code set_extension] are stored in class attributes on the
| #[code Underscore] class. If you write to an extension attribute, e.g.
| #[code doc._.hello = True], the data is stored within the
| #[+api("doc#attributes") #[code Doc.user_data]] dictionary. To keep the
| underscore data separate from your other dictionary entries, the string
| #[code "._."] is placed before the name, in a tuple.
+h(4, "component-example1") Example: Custom sentence segmentation logic
p
| Let's say you want to implement custom logic to improve spaCy's sentence
| boundary detection. Currently, sentence segmentation is based on the
| dependency parse, which doesn't always produce ideal results. The custom
| logic should therefore be applied #[strong after] tokenization, but
| #[strong before] the dependency parsing this way, the parser can also
| take advantage of the sentence boundaries.
| When spaCy loads a model via its #[code meta.json], it will iterate over
| the #[code "pipeline"] setting, look up every component name in the
| internal factories and call
| #[+api("language#create_pipe") #[code nlp.create_pipe]] to initialise the
| individual components, like the tagger, parser or entity recogniser. If
| your model uses custom components, this won't work so you'll have to
| tell spaCy #[strong where to find your component]. You can do this by
| writing to the #[code Language.factories]:
+code.
def sbd_component(doc):
for i, token in enumerate(doc[:-2]):
# define sentence start if period + titlecase token
if token.text == '.' and doc[i+1].is_title:
doc[i+1].sent_start = True
return doc
nlp = spacy.load('en')
nlp.add_pipe(sbd_component, before='parser') # insert before the parser
+h(4, "component-example2")
| Example: Pipeline component for entity matching and tagging with
| custom attributes
from spacy.language import Language
Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
p
| This example shows how to create a spaCy extension that takes a
| terminology list (in this case, single- and multi-word company names),
| matches the occurences in a document, labels them as #[code ORG] entities,
| merges the tokens and sets custom #[code is_tech_org] and
| #[code has_tech_org] attributes. For efficient matching, the example uses
| the #[+api("phrasematcher") #[code PhraseMatcher]] which accepts
| #[code Doc] objects as match patterns and works well for large
| terminology lists. It also ensures your patterns will always match, even
| when you customise spaCy's tokenization rules. When you call #[code nlp]
| on a text, the custom pipeline component is applied to the #[code Doc]
| You can also ship the above code and your custom component in your
| packaged model's #[code __init__.py], so it's executed when you load your
| model. The #[code **cfg] config parameters are passed all the way down
| from #[+api("spacy#load") #[code spacy.load]], so you can load the model
| and its components with custom settings:
+github("spacy", "examples/pipeline/custom_component_entities.py", 500)
+code.
nlp = spacy.load('your_custom_model', terms=(u'tree kangaroo'), label='ANIMAL')
p
| Wrapping this functionality in a
| pipeline component allows you to reuse the module with different
| settings, and have all pre-processing taken care of when you call
| #[code nlp] on your text and receive a #[code Doc] object.
+h(4, "component-example3")
| Example: Pipeline component for GPE entities and country meta data via a
| REST API
p
| This example shows the implementation of a pipeline component
| that fetches country meta data via the
| #[+a("https://restcountries.eu") REST Countries API] sets entity
| annotations for countries, merges entities into one token and
| sets custom attributes on the #[code Doc], #[code Span] and
| #[code Token] for example, the capital, latitude/longitude coordinates
| and even the country flag.
+github("spacy", "examples/pipeline/custom_component_countries_api.py", 500)
p
| In this case, all data can be fetched on initialisation in one request.
| However, if you're working with text that contains incomplete country
| names, spelling mistakes or foreign-language versions, you could also
| implement a #[code like_country]-style getter function that makes a
| request to the search API endpoint and returns the best-matching
| result.
+h(4, "custom-components-usage-ideas") Other usage ideas
+list
+item
| #[strong Adding new features and hooking in models]. For example,
| a sentiment analysis model, or your preferred solution for
| lemmatization or sentiment analysis. spaCy's built-in tagger,
| parser and entity recognizer respect annotations that were already
| set on the #[code Doc] in a previous step of the pipeline.
+item
| #[strong Integrating other libraries and APIs]. For example, your
| pipeline component can write additional information and data
| directly to the #[code Doc] or #[code Token] as custom attributes,
| while making sure no information is lost in the process. This can
| be output generated by other libraries and models, or an external
| service with a REST API.
+item
| #[strong Debugging and logging]. For example, a component which
| stores and/or exports relevant information about the current state
| of the processed document, and insert it at any point of your
| pipeline.
+infobox("Developing third-party extensions")
| The new pipeline management and custom attributes finally make it easy
| to develop your own spaCy extensions and plugins and share them with
| others. Extensions can claim their own #[code ._] namespace and exist as
| standalone packages. If you're developing a tool or library and want to
| make it easy for others to use it with spaCy and add it to their
| pipeline, all you have to do is expose a function that takes a
| #[code Doc], modifies it and returns it. For more details and
| #[strong best practices], see the section on
| #[+a("#extensions") developing spaCy extensions].
+h(3, "custom-components-user-hooks") User hooks
p
| While it's generally recommended to use the #[code Doc._], #[code Span._]
| and #[code Token._] proxies to add your own custom attributes, spaCy
| offers a few exceptions to allow #[strong customising the built-in methods]
| like #[+api("doc#similarity") #[code Doc.similarity]] or
| #[+api("doc#vector") #[code Doc.vector]]. with your own hooks, which can
| rely on statistical models you train yourself. For instance, you can
| provide your own on-the-fly sentence segmentation algorithm or document
| similarity method.
p
| Hooks let you customize some of the behaviours of the #[code Doc],
| #[code Span] or #[code Token] objects by adding a component to the
| pipeline. For instance, to customize the
| #[+api("doc#similarity") #[code Doc.similarity]] method, you can add a
| component that sets a custom function to
| #[code doc.user_hooks['similarity']]. The built-in #[code Doc.similarity]
| method will check the #[code user_hooks] dict, and delegate to your
| function if you've set one. Similar results can be achieved by setting
| functions to #[code Doc.user_span_hooks] and #[code Doc.user_token_hooks].
+aside("Implementation note")
| The hooks live on the #[code Doc] object because the #[code Span] and
| #[code Token] objects are created lazily, and don't own any data. They
| just proxy to their parent #[code Doc]. This turns out to be convenient
| here — we only have to worry about installing hooks in one place.
+table(["Name", "Customises"])
+row
+cell #[code user_hooks]
+cell
+api("doc#vector") #[code Doc.vector]
+api("doc#has_vector") #[code Doc.has_vector]
+api("doc#vector_norm") #[code Doc.vector_norm]
+api("doc#sents") #[code Doc.sents]
+row
+cell #[code user_token_hooks]
+cell
+api("token#similarity") #[code Token.similarity]
+api("token#vector") #[code Token.vector]
+api("token#has_vector") #[code Token.has_vector]
+api("token#vector_norm") #[code Token.vector_norm]
+api("token#conjuncts") #[code Token.conjuncts]
+row
+cell #[code user_span_hooks]
+cell
+api("span#similarity") #[code Span.similarity]
+api("span#vector") #[code Span.vector]
+api("span#has_vector") #[code Span.has_vector]
+api("span#vector_norm") #[code Span.vector_norm]
+api("span#root") #[code Span.root]
+code("Add custom similarity hooks").
class SimilarityModel(object):
def __init__(self, model):
self._model = model
def __call__(self, doc):
doc.user_hooks['similarity'] = self.similarity
doc.user_span_hooks['similarity'] = self.similarity
doc.user_token_hooks['similarity'] = self.similarity
def similarity(self, obj1, obj2):
y = self._model([obj1.vector, obj2.vector])
return float(y[0])
+infobox("Important note", "⚠️")
| When you load a model via its shortcut or package name, like
| #[code en_core_web_sm], spaCy will import the package and then call its
| #[code load()] method. This means that custom code in the model's
| #[code __init__.py] will be executed, too. This is #[strong not the case]
| if you're loading a model from a path containing the model data. Here,
| spaCy will only read in the #[code meta.json]. If you want to use custom
| factories with a model loaded from a path, you need to add them to
| #[code Language.factories] #[em before] you load the model.

View File

@ -1,11 +1,320 @@
//- 💫 DOCS > USAGE > PROCESSING PIPELINES > DEVELOPING EXTENSIONS
//- 💫 DOCS > USAGE > PROCESSING PIPELINES > EXTENSIONS
p
| As of v2.0, spaCy allows you to set any custom attributes and methods
| on the #[code Doc], #[code Span] and #[code Token], which become
| available as #[code Doc._], #[code Span._] and #[code Token._] for
| example, #[code Token._.my_attr]. This lets you store additional
| information relevant to your application, add new features and
| functionality to spaCy, and implement your own models trained with other
| machine learning libraries. It also lets you take advantage of spaCy's
| data structures and the #[code Doc] object as the "single source of
| truth".
+aside("Why ._?")
| Writing to a #[code ._] attribute instead of to the #[code Doc] directly
| keeps a clearer separation and makes it easier to ensure backwards
| compatibility. For example, if you've implemented your own #[code .coref]
| property and spaCy claims it one day, it'll break your code. Similarly,
| just by looking at the code, you'll immediately know what's built-in and
| what's custom for example, #[code doc.sentiment] is spaCy, while
| #[code doc._.sent_score] isn't.
p
| There are three main types of extensions, which can be defined using the
| #[+api("doc#set_extension") #[code Doc.set_extension]],
| #[+api("span#set_extension") #[code Span.set_extension]] and
| #[+api("token#set_extension") #[code Token.set_extension]] methods.
+list("numbers")
+item #[strong Attribute extensions].
| Set a default value for an attribute, which can be overwritten
| manually at any time. Attribute extensions work like "normal"
| variables and are the quickest way to store arbitrary information
| on a #[code Doc], #[code Span] or #[code Token]. Attribute defaults
| behaves just like argument defaults
| #[+a("http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments") in Python functions],
| and should not be used for mutable values like dictionaries or lists.
+code-wrapper
+code.
Doc.set_extension('hello', default=True)
assert doc._.hello
doc._.hello = False
+item #[strong Property extensions].
| Define a getter and an optional setter function. If no setter is
| provided, the extension is immutable. Since the getter and setter
| functions are only called when you #[em retrieve] the attribute,
| you can also access values of previously added attribute extensions.
| For example, a #[code Doc] getter can average over #[code Token]
| attributes. For #[code Span] extensions, you'll almost always want
| to use a property otherwise, you'd have to write to
| #[em every possible] #[code Span] in the #[code Doc] to set up the
| values correctly.
+code-wrapper
+code.
Doc.set_extension('hello', getter=get_hello_value, setter=set_hello_value)
assert doc._.hello
doc._.hello = 'Hi!'
+item #[strong Method extensions].
| Assign a function that becomes available as an object method. Method
| extensions are always immutable. For more details and implementation
| ideas, see
| #[+a("/usage/examples#custom-components-attr-methods") these examples].
+code-wrapper
+code.o-no-block.
Doc.set_extension('hello', method=lambda doc, name: 'Hi {}!'.format(name))
assert doc._.hello('Bob') == 'Hi Bob!'
p
| Before you can access a custom extension, you need to register it using
| the #[code set_extension] method on the object you want
| to add it to, e.g. the #[code Doc]. Keep in mind that extensions are
| always #[strong added globally] and not just on a particular instance.
| If an attribute of the same name
| already exists, or if you're trying to access an attribute that hasn't
| been registered, spaCy will raise an #[code AttributeError].
+code("Example").
from spacy.tokens import Doc, Span, Token
fruits = ['apple', 'pear', 'banana', 'orange', 'strawberry']
is_fruit_getter = lambda token: token.text in fruits
has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])
Token.set_extension('is_fruit', getter=is_fruit_getter)
Doc.set_extension('has_fruit', getter=has_fruit_getter)
Span.set_extension('has_fruit', getter=has_fruit_getter)
+aside-code("Usage example").
doc = nlp(u"I have an apple and a melon")
assert doc[3]._.is_fruit # get Token attributes
assert not doc[0]._.is_fruit
assert doc._.has_fruit # get Doc attributes
assert doc[1:4]._.has_fruit # get Span attributes
p
| Once you've registered your custom attribute, you can also use the
| built-in #[code set], #[code get] and #[code has] methods to modify and
| retrieve the attributes. This is especially useful it you want to pass in
| a string instead of calling #[code doc._.my_attr].
+table(["Method", "Description", "Valid for", "Example"])
+row
+cell #[code ._.set()]
+cell Set a value for an attribute.
+cell Attributes, mutable properties.
+cell #[code.u-break token._.set('my_attr', True)]
+row
+cell #[code ._.get()]
+cell Get the value of an attribute.
+cell Attributes, mutable properties, immutable properties, methods.
+cell #[code.u-break my_attr = span._.get('my_attr')]
+row
+cell #[code ._.has()]
+cell Check if an attribute exists.
+cell Attributes, mutable properties, immutable properties, methods.
+cell #[code.u-break doc._.has('my_attr')]
+infobox("How the ._ is implemented")
| Extension definitions the defaults, methods, getters and setters you
| pass in to #[code set_extension] are stored in class attributes on the
| #[code Underscore] class. If you write to an extension attribute, e.g.
| #[code doc._.hello = True], the data is stored within the
| #[+api("doc#attributes") #[code Doc.user_data]] dictionary. To keep the
| underscore data separate from your other dictionary entries, the string
| #[code "._."] is placed before the name, in a tuple.
+h(4, "component-example1") Example: Custom sentence segmentation logic
p
| Let's say you want to implement custom logic to improve spaCy's sentence
| boundary detection. Currently, sentence segmentation is based on the
| dependency parse, which doesn't always produce ideal results. The custom
| logic should therefore be applied #[strong after] tokenization, but
| #[strong before] the dependency parsing this way, the parser can also
| take advantage of the sentence boundaries.
+code-exec.
import spacy
def sbd_component(doc):
for i, token in enumerate(doc[:-2]):
# define sentence start if period + titlecase token
if token.text == '.' and doc[i+1].is_title:
doc[i+1].sent_start = True
return doc
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(sbd_component, before='parser') # insert before the parser
doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
print(sent.text)
+h(4, "component-example2")
| Example: Pipeline component for entity matching and tagging with
| custom attributes
p
| This example shows how to create a spaCy extension that takes a
| terminology list (in this case, single- and multi-word company names),
| matches the occurences in a document, labels them as #[code ORG] entities,
| merges the tokens and sets custom #[code is_tech_org] and
| #[code has_tech_org] attributes. For efficient matching, the example uses
| the #[+api("phrasematcher") #[code PhraseMatcher]] which accepts
| #[code Doc] objects as match patterns and works well for large
| terminology lists. It also ensures your patterns will always match, even
| when you customise spaCy's tokenization rules. When you call #[code nlp]
| on a text, the custom pipeline component is applied to the #[code Doc]
+github("spacy", "examples/pipeline/custom_component_entities.py", 500)
p
| Wrapping this functionality in a
| pipeline component allows you to reuse the module with different
| settings, and have all pre-processing taken care of when you call
| #[code nlp] on your text and receive a #[code Doc] object.
+h(4, "component-example3")
| Example: Pipeline component for GPE entities and country meta data via a
| REST API
p
| This example shows the implementation of a pipeline component
| that fetches country meta data via the
| #[+a("https://restcountries.eu") REST Countries API] sets entity
| annotations for countries, merges entities into one token and
| sets custom attributes on the #[code Doc], #[code Span] and
| #[code Token] for example, the capital, latitude/longitude coordinates
| and even the country flag.
+github("spacy", "examples/pipeline/custom_component_countries_api.py", 500)
p
| In this case, all data can be fetched on initialisation in one request.
| However, if you're working with text that contains incomplete country
| names, spelling mistakes or foreign-language versions, you could also
| implement a #[code like_country]-style getter function that makes a
| request to the search API endpoint and returns the best-matching
| result.
+h(4, "custom-components-usage-ideas") Other usage ideas
+list
+item
| #[strong Adding new features and hooking in models]. For example,
| a sentiment analysis model, or your preferred solution for
| lemmatization or sentiment analysis. spaCy's built-in tagger,
| parser and entity recognizer respect annotations that were already
| set on the #[code Doc] in a previous step of the pipeline.
+item
| #[strong Integrating other libraries and APIs]. For example, your
| pipeline component can write additional information and data
| directly to the #[code Doc] or #[code Token] as custom attributes,
| while making sure no information is lost in the process. This can
| be output generated by other libraries and models, or an external
| service with a REST API.
+item
| #[strong Debugging and logging]. For example, a component which
| stores and/or exports relevant information about the current state
| of the processed document, and insert it at any point of your
| pipeline.
+infobox("Developing third-party extensions")
| The new pipeline management and custom attributes finally make it easy
| to develop your own spaCy extensions and plugins and share them with
| others. Extensions can claim their own #[code ._] namespace and exist as
| standalone packages. If you're developing a tool or library and want to
| make it easy for others to use it with spaCy and add it to their
| pipeline, all you have to do is expose a function that takes a
| #[code Doc], modifies it and returns it. For more details and
| #[strong best practices], see the section on
| #[+a("#extensions") developing spaCy extensions].
+h(3, "custom-components-user-hooks") User hooks
p
| While it's generally recommended to use the #[code Doc._], #[code Span._]
| and #[code Token._] proxies to add your own custom attributes, spaCy
| offers a few exceptions to allow #[strong customising the built-in methods]
| like #[+api("doc#similarity") #[code Doc.similarity]] or
| #[+api("doc#vector") #[code Doc.vector]]. with your own hooks, which can
| rely on statistical models you train yourself. For instance, you can
| provide your own on-the-fly sentence segmentation algorithm or document
| similarity method.
p
| Hooks let you customize some of the behaviours of the #[code Doc],
| #[code Span] or #[code Token] objects by adding a component to the
| pipeline. For instance, to customize the
| #[+api("doc#similarity") #[code Doc.similarity]] method, you can add a
| component that sets a custom function to
| #[code doc.user_hooks['similarity']]. The built-in #[code Doc.similarity]
| method will check the #[code user_hooks] dict, and delegate to your
| function if you've set one. Similar results can be achieved by setting
| functions to #[code Doc.user_span_hooks] and #[code Doc.user_token_hooks].
+aside("Implementation note")
| The hooks live on the #[code Doc] object because the #[code Span] and
| #[code Token] objects are created lazily, and don't own any data. They
| just proxy to their parent #[code Doc]. This turns out to be convenient
| here — we only have to worry about installing hooks in one place.
+table(["Name", "Customises"])
+row
+cell #[code user_hooks]
+cell
+api("doc#vector") #[code Doc.vector]
+api("doc#has_vector") #[code Doc.has_vector]
+api("doc#vector_norm") #[code Doc.vector_norm]
+api("doc#sents") #[code Doc.sents]
+row
+cell #[code user_token_hooks]
+cell
+api("token#similarity") #[code Token.similarity]
+api("token#vector") #[code Token.vector]
+api("token#has_vector") #[code Token.has_vector]
+api("token#vector_norm") #[code Token.vector_norm]
+api("token#conjuncts") #[code Token.conjuncts]
+row
+cell #[code user_span_hooks]
+cell
+api("span#similarity") #[code Span.similarity]
+api("span#vector") #[code Span.vector]
+api("span#has_vector") #[code Span.has_vector]
+api("span#vector_norm") #[code Span.vector_norm]
+api("span#root") #[code Span.root]
+code("Add custom similarity hooks").
class SimilarityModel(object):
def __init__(self, model):
self._model = model
def __call__(self, doc):
doc.user_hooks['similarity'] = self.similarity
doc.user_span_hooks['similarity'] = self.similarity
doc.user_token_hooks['similarity'] = self.similarity
def similarity(self, obj1, obj2):
y = self._model([obj1.vector, obj2.vector])
return float(y[0])
+h(3, "extensions") Developing spaCy extensions
p
| We're very excited about all the new possibilities for community
| extensions and plugins in spaCy v2.0, and we can't wait to see what
| you build with it! To get you started, here are a few tips, tricks and
| best practices. For examples of other spaCy extensions, see the
| #[+a("/usage/resources#extensions") resources].
| best practices. #[+a("/universe/?category=pipeline") See here] for
| examples of other spaCy extensions.
+list
+item

View File

@ -108,7 +108,7 @@ p
| project on Twitter, don't forget to tag
| #[+a("https://twitter.com/" + SOCIAL.twitter) @#{SOCIAL.twitter}] so we
| don't miss it. If you think your project would be a good fit for the
| #[+a("/usage/resources") resources], #[strong feel free to submit it!]
| #[+a("/universe") spaCy Universe], #[strong feel free to submit it!]
| Tutorials are also incredibly valuable to other users and a great way to
| get exposure. So we strongly encourage #[strong writing up your experiences],
| or sharing your code and some tips and tricks on your blog. Since our

View File

@ -7,16 +7,18 @@ p
+h(3, "lightning-tour-models") Install models and process text
+code(false, "bash").
python -m spacy download en
python -m spacy download de
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
+code.
+code-exec.
import spacy
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Hello, world. Here are two sentences.')
print([t.text for t in doc])
nlp_de = spacy.load('de')
nlp_de = spacy.load('de_core_news_sm')
doc_de = nlp_de(u'Ich bin ein Berliner.')
print([t.text for t in doc_de])
+infobox
| #[+label-inline API:] #[+api("spacy#load") #[code spacy.load()]]
@ -26,19 +28,23 @@ p
+h(3, "lightning-tour-tokens-sentences") Get tokens, noun chunks & sentences
+tag-model("dependency parse")
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Peach emoji is where it has always been. Peach is the superior "
u"emoji. It's outranking eggplant 🍑 ")
print(doc[0].text) # Peach
print(doc[1].text) # emoji
print(doc[-1].text) # 🍑
print(doc[17:19].text) # outranking eggplant
assert doc[0].text == u'Peach'
assert doc[1].text == u'emoji'
assert doc[-1].text == u'🍑'
assert doc[17:19].text == u'outranking eggplant'
assert list(doc.noun_chunks)[0].text == u'Peach emoji'
noun_chunks = list(doc.noun_chunks)
print(noun_chunks[0].text) # Peach emoji
sentences = list(doc.sents)
assert len(sentences) == 3
assert sentences[1].text == u'Peach is the superior emoji.'
print(sentences[1].text) # 'Peach is the superior emoji.'
+infobox
| #[+label-inline API:] #[+api("doc") #[code Doc]], #[+api("token") #[code Token]]
@ -47,19 +53,22 @@ p
+h(3, "lightning-tour-pos-tags") Get part-of-speech tags and flags
+tag-model("tagger")
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
apple = doc[0]
assert [apple.pos_, apple.pos] == [u'PROPN', 17049293600679659579]
assert [apple.tag_, apple.tag] == [u'NNP', 15794550382381185553]
assert [apple.shape_, apple.shape] == [u'Xxxxx', 16072095006890171862]
assert apple.is_alpha == True
assert apple.is_punct == False
print('Fine-grained POS tag', apple.pos_, apple.pos)
print('Coarse-grained POS tag', apple.tag_, apple.tag)
print('Word shape', apple.shape_, apple.shape)
print('Alphanumeric characters?', apple.is_alpha)
print('Punctuation mark?', apple.is_punct)
billion = doc[10]
assert billion.is_digit == False
assert billion.like_num == True
assert billion.like_email == False
print('Digit?', billion.is_digit)
print('Like a number?', billion.like_num)
print('Like an email address?', billion.like_email)
+infobox
| #[+label-inline API:] #[+api("token") #[code Token]]
@ -67,19 +76,25 @@ p
+h(3, "lightning-tour-hashes") Use hash values for any string
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
coffee_hash = nlp.vocab.strings[u'coffee'] # 3197928453018144401
coffee_text = nlp.vocab.strings[coffee_hash] # 'coffee'
assert doc[2].orth == coffee_hash == 3197928453018144401
assert doc[2].text == coffee_text == u'coffee'
print(coffee_hash, coffee_text)
print(doc[2].orth, coffee_hash) # 3197928453018144401
print(doc[2].text, coffee_text) # 'coffee'
beer_hash = doc.vocab.strings.add(u'beer') # 3073001599257881079
beer_text = doc.vocab.strings[beer_hash] # 'beer'
print(beer_hash, beer_text)
unicorn_hash = doc.vocab.strings.add(u'🦄 ') # 18234233413267120783
unicorn_text = doc.vocab.strings[unicorn_hash] # '🦄 '
print(unicorn_hash, unicorn_text)
+infobox
| #[+label-inline API:] #[+api("stringstore") #[code StringStore]]
@ -88,16 +103,19 @@ p
+h(3, "lightning-tour-entities") Recognise and update named entities
+tag-model("NER")
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'San Francisco considers banning sidewalk delivery robots')
ents = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
assert ents == [(u'San Francisco', 0, 13, u'GPE')]
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
from spacy.tokens import Span
doc = nlp(u'Netflix is hiring a new VP of global policy')
doc = nlp(u'FB is hiring a new VP of global policy')
doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings[u'ORG'])]
ents = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
assert ents == [(0, 7, u'ORG')]
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
+infobox
| #[+label-inline Usage:] #[+a("/usage/linguistic-features#named-entities") Named entity recognition]
@ -181,21 +199,25 @@ p
+h(3, "lightning-tour-word-vectors") Get word vectors and similarity
+tag-model("word vectors")
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_md')
doc = nlp(u"Apple and banana are similar. Pasta and hippo aren't.")
apple = doc[0]
banana = doc[2]
pasta = doc[6]
hippo = doc[8]
assert apple.similarity(banana) > pasta.similarity(hippo)
assert apple.has_vector
assert banana.has_vector
assert pasta.has_vector
assert hippo.has_vector
print('apple <-> banana', apple.similarity(banana))
print('pasta <-> hippo', pasta.similarity(hippo))
print(apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector)
p
| For the best results, you should run this example using the
| #[+a("/models/en#en_vectors_web_lg") #[code en_vectors_web_lg]] model.
| #[+a("/models/en#en_vectors_web_lg") #[code en_vectors_web_lg]] model
| (currently not available in the live demo).
+infobox
| #[+label-inline Usage:] #[+a("/usage/vectors-similarity") Word vectors and similarity]
@ -221,11 +243,11 @@ p
+h(3, "lightning-tour-rule-matcher") Match text with token rules
+code.
+code-exec.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
def set_sentiment(matcher, doc, i, matches):
@ -235,8 +257,15 @@ p
pattern2 = [[{'ORTH': emoji, 'OP': '+'}] for emoji in ['😀', '😂', '🤣', '😍']]
matcher.add('GoogleIO', None, pattern1) # match "Google I/O" or "Google i/o"
matcher.add('HAPPY', set_sentiment, *pattern2) # match one or more happy emoji
text = open('customer_feedback_627.txt').read()
matches = nlp(text)
doc = nlp(u"A text about Google I/O 😀😀")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(string_id, span.text)
print('Sentiment', doc.sentiment)
+infobox
| #[+label-inline API:] #[+api("matcher") #[code Matcher]]
@ -260,14 +289,19 @@ p
+h(3, "lightning-tour-dependencies") Get syntactic dependencies
+tag-model("dependency parse")
+code.
def dependency_labels_to_root(token):
"""Walk up the syntactic tree, collecting the arc labels."""
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"When Sebastian Thrun started working on self-driving cars at Google "
u"in 2007, few people outside of the company took him seriously.")
dep_labels = []
for token in doc:
while token.head != token:
dep_labels.append(token.dep)
dep_labels.append(token.dep_)
token = token.head
return dep_labels
print(dep_labels)
+infobox
| #[+label-inline API:] #[+api("token") #[code Token]]
@ -275,25 +309,38 @@ p
+h(3, "lightning-tour-numpy-arrays") Export to numpy arrays
+code.
from spacy.attrs import ORTH, LIKE_URL, IS_OOV
+code-exec.
import spacy
from spacy.attrs import ORTH, LIKE_URL
attr_ids = [ORTH, LIKE_URL, IS_OOV]
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Check out https://spacy.io")
for token in doc:
print(token.text, token.orth, token.like_url)
attr_ids = [ORTH, LIKE_URL]
doc_array = doc.to_array(attr_ids)
assert doc_array.shape == (len(doc), len(attr_ids))
print(doc_array.shape)
print(len(doc), len(attr_ids))
assert doc[0].orth == doc_array[0, 0]
assert doc[1].orth == doc_array[1, 0]
assert doc[0].like_url == doc_array[0, 1]
assert list(doc_array[:, 1]) == [t.like_url for t in doc]
print(list(doc_array[:, 1]))
+h(3, "lightning-tour-inline") Calculate inline markup on original string
+code.
def put_spans_around_tokens(doc, get_classes):
"""Given some function to compute class names, put each token in a
+code-exec.
import spacy
def put_spans_around_tokens(doc):
"""Here, we're building a custom "syntax highlighter" for
part-of-speech tags and dependencies. We put each token in a
span element, with the appropriate classes computed. All whitespace is
preserved, outside of the spans. (Of course, HTML won't display more than
one whitespace character it but the point is, no information is lost
preserved, outside of the spans. (Of course, HTML will only display
multiple whitespace if enabled but the point is, no information is lost
and you can calculate what you need, e.g. &lt;br /&gt;, &lt;p&gt; etc.)
"""
output = []
@ -302,9 +349,14 @@ p
if token.is_space:
output.append(token.text)
else:
classes = ' '.join(get_classes(token))
classes = 'pos-{} dep-{}'.format(token.pos_, token.dep_)
output.append(html.format(classes=classes, word=token.text, space=token.whitespace_))
string = ''.join(output)
string = string.replace('\n', '')
string = string.replace('\t', ' ')
return string
return '&lt;pre&gt;{}&lt;/pre&gt;.format(string)
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"This is a test.\n\nHello world.")
html = put_spans_around_tokens(doc)
print(html)

View File

@ -13,7 +13,10 @@ p
p
| Named entities are available as the #[code ents] property of a #[code Doc]:
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for ent in doc.ents:

View File

@ -17,7 +17,10 @@ p
| representation of an attribute, we need to add an underscore #[code _]
| to its name:
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:

View File

@ -16,12 +16,15 @@ p
| really depends on how you're looking at it. spaCy's similarity model
| usually assumes a pretty general-purpose definition of similarity.
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.similarity(token2))
print(token1.text, token2.text, token1.similarity(token2))
+aside
| #[strong #[+procon("neutral", "identical", false, 16)] similarity:] identical#[br]

View File

@ -8,7 +8,11 @@ p
| Each #[code Doc] consists of individual tokens, and we can simply iterate
| over them:
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token.text)

View File

@ -31,10 +31,13 @@ p
| #[strong lookup table that works in both directions] you can look up a
| string to get its hash, or a hash to get its string:
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
assert doc.vocab.strings[u'coffee'] == 3197928453018144401
assert doc.vocab.strings[3197928453018144401] == u'coffee'
print(doc.vocab.strings[u'coffee']) # 3197928453018144401
print(doc.vocab.strings[3197928453018144401]) # 'coffee'
+aside("What does 'L' at the end of a hash mean?")
| If you return a hash value in the #[strong Python 2 interpreter], it'll
@ -51,7 +54,11 @@ p
| context, its spelling and whether it consists of alphabetic characters
| won't ever change. Its hash value will also always be the same.
+code.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
for word in doc:
lexeme = doc.vocab[word.text]
print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
@ -87,22 +94,24 @@ p
| sure all objects you create have access to the same vocabulary. If they
| don't, spaCy might not be able to find the strings it needs.
+code.
+code-exec.
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab
doc = nlp(u'I like coffee') # original Doc
assert doc.vocab.strings[u'coffee'] == 3197928453018144401 # get hash
assert doc.vocab.strings[3197928453018144401] == u'coffee' # 👍
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee') # original Doc
print(doc.vocab.strings[u'coffee']) # 3197928453018144401
print(doc.vocab.strings[3197928453018144401]) # 'coffee' 👍
empty_doc = Doc(Vocab()) # new Doc with empty Vocab
# empty_doc.vocab.strings[3197928453018144401] will raise an error :(
empty_doc.vocab.strings.add(u'coffee') # add "coffee" and generate hash
assert empty_doc.vocab.strings[3197928453018144401] == u'coffee' # 👍
print(empty_doc.vocab.strings[3197928453018144401]) # 'coffee' 👍
new_doc = Doc(doc.vocab) # create new doc with first doc's vocab
assert new_doc.vocab.strings[3197928453018144401] == u'coffee' # 👍
print(new_doc.vocab.strings[3197928453018144401]) # 'coffee' 👍
p
| If the vocabulary doesn't contain a string for #[code 3197928453018144401],

View File

@ -130,9 +130,11 @@ p
| assigned, and get the L2 norm, which can be used to normalise
| vectors.
+code.
nlp = spacy.load('en_core_web_lg')
tokens = nlp(u'dog cat banana sasquatch')
+code-exec.
import spacy
nlp = spacy.load('en_core_web_md')
tokens = nlp(u'dog cat banana afskfsd')
for token in tokens:
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
@ -142,19 +144,12 @@ p
| #[strong has vector]: Does the token have a vector representation?#[br]
| #[strong Vector norm]: The L2 norm of the token's vector (the square root
| of the sum of the values squared)#[br]
| #[strong is OOV]: Is the word out-of-vocabulary?
+table(["Text", "Has vector", "Vector norm", "OOV"])
- var style = [0, 1, 1, 1]
+annotation-row(["dog", true, 7.033672992262838, false], style)
+annotation-row(["cat", true, 6.68081871208896, false], style)
+annotation-row(["banana", true, 6.700014292148571, false], style)
+annotation-row(["sasquatch", false, 0, true], style)
| #[strong OOV]: Out-of-vocabulary
p
| The words "dog", "cat" and "banana" are all pretty common in English, so
| they're part of the model's vocabulary, and come with a vector. The word
| "sasquatch" on the other hand is a lot less common and out-of-vocabulary
| "afskfsd" on the other hand is a lot less common and out-of-vocabulary
| so its vector representation consists of 300 dimensions of #[code 0],
| which means it's practically nonexistent. If your application will
| benefit from a #[strong large vocabulary] with more vectors, you should

View File

@ -2,6 +2,27 @@
include ../_spacy-101/_training
+h(3, "spacy-train-cli") Training via the command-line interface
p
| For most purposes, the best way to train spaCy is via the command-line
| interface. The #[+api("cli#train") #[code spacy train]] command takes
| care of many details for you, including making sure that the data is
| minibatched and shuffled correctly, progress is printed, and models are
| saved after each epoch. You can prepare your data for use in
| #[+api("cli#train") #[code spacy train]] using the
| #[+api("cli#convert") #[code spacy convert]] command, which accepts many
| common NLP data formats, including #[code .iob] for named entities, and
| the CoNLL format for dependencies:
+code("Example", "bash").
git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
mkdir ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.json ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.json ancora-json
mkdir models
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
+h(3, "training-data") How do I get training data?
p

View File

@ -4,7 +4,7 @@
+tag-new(2)
p
| This example shows how to train a multi-label convolutional neural
| This example shows how to train a convolutional neural
| network text classifier on IMDB movie reviews, using spaCy's new
| #[+api("textcategorizer") #[code TextCategorizer]] component. The
| dataset will be loaded automatically via Thinc's built-in dataset

View File

@ -34,8 +34,7 @@ p
| #[strong under 1 GB of memory] per process.
+infobox
| #[+label-inline Usage:] #[+a("/models") Models directory],
| #[+a("/models/comparison") Models comparison],
| #[+label-inline Usage:] #[+a("/models") Models directory]
| #[+a("#benchmarks") Benchmarks]
+h(3, "features-pipelines") Improved processing pipelines

View File

@ -1,135 +0,0 @@
//- 💫 DOCS > USAGE > VECTORS & SIMILARITY > BASICS
+aside("Training word vectors")
| Dense, real valued vectors representing distributional similarity
| information are now a cornerstone of practical NLP. The most common way
| to train these vectors is the #[+a("https://en.wikipedia.org/wiki/Word2vec") word2vec]
| family of algorithms. If you need to train a word2vec model, we recommend
| the implementation in the Python library
| #[+a("https://radimrehurek.com/gensim/") Gensim].
include ../_spacy-101/_similarity
include ../_spacy-101/_word-vectors
+h(3, "in-context") Similarities in context
p
| Aside from spaCy's built-in word vectors, which were trained on a lot of
| text with a wide vocabulary, the parsing, tagging and NER models also
| rely on vector representations of the #[strong meanings of words in context].
| As the #[+a("/usage/processing-pipelines") processing pipeline] is
| applied spaCy encodes a document's internal meaning representations as an
| array of floats, also called a tensor. This allows spaCy to make a
| reasonable guess at a word's meaning, based on its surrounding words.
| Even if a word hasn't been seen before, spaCy will know #[em something]
| about it. Because spaCy uses a 4-layer convolutional network, the
| tensors are sensitive to up to #[strong four words on either side] of a
| word.
p
| For example, here are three sentences containing the out-of-vocabulary
| word "labrador" in different contexts.
+code.
doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"the labrador people live in canada.")
for doc in [doc1, doc2, doc3]:
labrador = doc[1]
dog = nlp(u"dog")
print(labrador.similarity(dog))
p
| Even though the model has never seen the word "labrador", it can make a
| fairly accurate prediction of its similarity to "dog" in different
| contexts.
+table(["Context", "labrador.similarity(dog)"])
+row
+cell The #[strong labrador] barked.
+cell #[code 0.56] #[+procon("yes", "similar")]
+row
+cell The #[strong labrador] swam.
+cell #[code 0.48] #[+procon("no", "dissimilar")]
+row
+cell the #[strong labrador] people live in canada.
+cell #[code 0.39] #[+procon("no", "dissimilar")]
p
| The same also works for whole documents. Here, the variance of the
| similarities is lower, as all words and their order are taken into
| account. However, the context-specific similarity is often still
| reflected pretty accurately.
+code.
doc1 = nlp(u"Paris is the largest city in France.")
doc2 = nlp(u"Vilnius is the capital of Lithuania.")
doc3 = nlp(u"An emu is a large bird.")
for doc in [doc1, doc2, doc3]:
for other_doc in [doc1, doc2, doc3]:
print(doc.similarity(other_doc))
p
| Even though the sentences about Paris and Vilnius consist of different
| words and entities, they both describe the same concept and are seen as
| more similar than the sentence about emus. In this case, even a misspelled
| version of "Vilnius" would still produce very similar results.
+table
- var examples = {"Paris is the largest city in France.": [1, 0.85, 0.65], "Vilnius is the capital of Lithuania.": [0.85, 1, 0.55], "An emu is a large bird.": [0.65, 0.55, 1]}
- var counter = 0
+row
+row
+cell
for _, label in examples
+cell=label
each cells, label in examples
+row(counter ? null : "divider")
+cell=label
for cell in cells
+cell.u-text-center
- var result = cell < 0.7 ? ["no", "dissimilar"] : cell != 1 ? ["yes", "similar"] : ["neutral", "identical"]
| #[code=cell.toFixed(2)] #[+procon(...result)]
- counter++
p
| Sentences that consist of the same words in different order will likely
| be seen as very similar but never identical.
+code.
docs = [nlp(u"dog bites man"), nlp(u"man bites dog"),
nlp(u"man dog bites"), nlp(u"dog man bites")]
for doc in docs:
for other_doc in docs:
print(doc.similarity(other_doc))
p
| Interestingly, "man bites dog" and "man dog bites" are seen as slightly
| more similar than "man bites dog" and "dog bites man". This may be a
| coincidence or the result of "man" being interpreted as both sentence's
| subject.
+table
- var examples = {"dog bites man": [1, 0.9, 0.89, 0.92], "man bites dog": [0.9, 1, 0.93, 0.9], "man dog bites": [0.89, 0.93, 1, 0.92], "dog man bites": [0.92, 0.9, 0.92, 1]}
- var counter = 0
+row("head")
+cell
for _, label in examples
+cell.u-text-center=label
each cells, label in examples
+row(counter ? null : "divider")
+cell=label
for cell in cells
+cell.u-text-center
- var result = cell < 0.7 ? ["no", "dissimilar"] : cell != 1 ? ["yes", "similar"] : ["neutral", "identical"]
| #[code=cell.toFixed(2)] #[+procon(...result)]
- counter++

View File

@ -36,6 +36,44 @@ p
| vector table produces rapidly diminishing returns in coverage over these
| rare terms.
+h(3, "converting") Converting word vectors for use in spaCy
+tag-new("2.0.10")
p
| Custom word vectors can be trained using a number of open-source libraries,
| such as #[+a("https://radimrehurek.com/gensim") Gensim],
| #[+a("https://fasttext.cc") Fast Text], or Tomas Mikolov's original
| #[+a("https://code.google.com/archive/p/word2vec/") word2vec implementation].
| Most word vector libraries output an easy-to-read text-based format, where
| each line consists of the word followed by its vector. For everyday use,
| we want to convert the vectors model into a binary format that loads faster
| and takes up less space on disk. The easiest way to do this is the
| #[+api("cli#init-model") #[code init-model]] command-line utility:
+code(false, "bash").
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
python -m spacy init-model /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
p
| This will output a spaCy model in the directory
| #[code /tmp/la_vectors_wiki_lg], giving you access to some nice Latin
| vectors 😉 You can then pass the directory path to
| #[+api("spacy#load") #[code spacy.load()]].
+code.
nlp_latin = spacy.load('/tmp/la_vectors_wiki_lg')
doc1 = nlp(u"Caecilius est in horto")
doc2 = nlp(u"servus est in atrio")
doc1.similarity(doc2)
p
| The model directory will have a #[code /vocab] directory with the strings,
| lexical entries and word vectors from the input vectors model. The
| #[+api("cli#init-model") #[code init-model]] command supports a number of
| archive formats for the word vectors: the vectors can be in plain text
| (#[code .txt]), zipped (#[code .zip]), or tarred and zipped
| (#[code .tgz]).
+h(3, "custom-vectors-coverage") Optimising vector coverage
+tag-new(2)
@ -98,6 +136,19 @@ p
| to the vector of "coast", which is deemed about 73% similar. "Leaving"
| was remapped to the vector of "leaving", which is identical.
p
| If you're using the #[+api("cli#init-model") #[code init-model]] command,
| you can set the #[code --prune-vectors] option to easily reduce the size
| of the vectors as you add them to a spaCy model:
+code(false, "bash", "$").
python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000
p
| This will create a spaCy model with vectors for the first 10,000 words in
| the vectors model. All other words in the vectors model are mapped to the
| closest vector among those retained.
+h(3, "custom-vectors-add") Adding vectors
+tag-new(2)
@ -172,18 +223,6 @@ p
| package the model using the #[+api("cli#package") #[code package]]
| command.
+h(3, "custom-loading-other") Loading other vectors
+tag-new(2)
p
| You can also choose to load in vectors from other sources, like the
| #[+a("https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md") fastText vectors]
| for 294 languages, trained on Wikipedia. After reading in the file,
| the vectors are added to the #[code Vocab] using the
| #[+api("vocab#set_vector") #[code set_vector]] method.
+github("spacy", "examples/vectors_fast_text.py")
+h(3, "custom-similarity") Using custom similarity methods
p

View File

@ -150,17 +150,6 @@ include ../_includes/_mixins
+github("spacy", "examples/training/train_textcat.py")
+section("vectors")
+h(3, "fasttext") Loading pre-trained fastText vectors
p
| This simple snippet is all you need to be able to use the Facebook's
| #[+a("https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md") fastText vectors]
| (294 languages, pre-trained on Wikipedia) with spaCy. Once they're
| loaded, the vectors will be available via spaCy's built-in
| #[code similarity()] methods.
+github("spacy", "examples/vectors_fast_text.py")
+h(3, "tensorboard") Visualizing spaCy vectors in TensorBoard
p

View File

@ -12,8 +12,9 @@ include _spacy-101/_pipelines
+h(2, "custom-components") Creating custom pipeline components
include _processing-pipelines/_custom-components
+section("extensions")
+h(2, "extensions") Developing spaCy extensions
+section("custom-components-extensions")
+h(2, "custom-components-extensions") Extension attributes
+tag-new(2)
include _processing-pipelines/_extensions
+section("multithreading")

View File

@ -1,189 +0,0 @@
//- 💫 DOCS > USAGE > RESOURCES
include ../_includes/_mixins
+aside("Contribute to this page")
| Have you built something cool with spaCy or come across a paper, book or
| course that should be featured here?
| #[a(href="mailto:#{EMAIL}") Let us know!]
+section("libraries")
+h(2, "libraries") Third-party libraries
+grid
+card("neuralcoref", "https://github.com/huggingface/neuralcoref", "Hugging Face", "github")
| State-of-the-art coreference resolution based on neural nets
| and spaCy
+card("rasa_nlu", "https://github.com/golastmile/rasa_nlu", "LastMile", "github")
| High level APIs for building your own language parser using
| existing NLP and ML libraries.
+card("textacy", "https://github.com/chartbeat-labs/textacy", "Burton DeWilde", "github")
| Higher-level NLP built on spaCy.
+card("mordecai", "https://github.com/openeventdata/mordecai", "Andy Halterman", "github")
| Full text geoparsing using spaCy, Geonames and Keras.
+card("kindred", "https://github.com/jakelever/kindred", "Jake Lever", "github")
| Biomedical relation extraction using spaCy.
+card("spacyr", "https://github.com/kbenoit/spacyr", "Kenneth Benoit", "github")
| An R wrapper for spaCy.
+card("spacy_api", "https://github.com/kootenpv/spacy_api", "Pascal van Kooten", "github")
| Server/client to load models in a separate, dedicated process.
+card("spacy-api-docker", "https://github.com/jgontrum/spacy-api-docker", "Johannes Gontrum", "github")
| spaCy accessed by a REST API, wrapped in a Docker container.
+card("languagecrunch", "https://github.com/artpar/languagecrunch", "Parth Mudgal", "github")
| NLP server for spaCy, WordNet and NeuralCoref as a Docker image.
+card("spacy-nlp-zeromq", "https://github.com/pasupulaphani/spacy-nlp-docker", "Phaninder Pasupula", "github")
| Docker image exposing spaCy with ZeroMQ bindings.
+card("spacy-nlp", "https://github.com/kengz/spacy-nlp", "Wah Loon Keng", "github")
| Expose spaCy NLP text parsing to Node.js (and other languages)
| via Socket.IO.
.u-text-right
+button("https://github.com/search?o=desc&q=spacy&s=stars&type=Repositories&utf8=%E2%9C%93", false, "primary", "small") See more projects on GitHub
+section("extensions")
+h(2, "extensions") Extensions & Pipeline Components
p
| This section lists spaCy extensions and components you can plug into
| your processing pipeline. For more details, see the docs on
| #[+a("/usage/processing-pipelines#custom-components") custom components]
| and #[+a("/usage/processing-pipelines#extensions") extensions].
+grid
+card("spacymoji", "https://github.com/ines/spacymoji", "Ines Montani", "github")
| Pipeline component for emoji handling and adding emoji meta data
| to #[code Doc], #[code Token] and #[code Span] attributes.
+card("spacy_hunspell", "https://github.com/tokestermw/spacy_hunspell", "Motoki Wu", "github")
| Add spellchecking and spelling suggestions to your spaCy pipeline
| using Hunspell.
+card("spacy_cld", "https://github.com/nickdavidhaynes/spacy-cld", "Nicholas D Haynes", "github")
| Add language detection to your spaCy pipeline using Compact
| Language Detector 2 via PYCLD2.
+card("spacy-lookup", "https://github.com/mpuig/spacy-lookup", "Marc Puig", "github")
| A powerful entity matcher for very large dictionaries, using the
| FlashText module.
+card("spacy-iwnlp", "https://github.com/Liebeck/spacy-iwnlp", "Matthias Liebeck", "github")
| German lemmatization with IWNLP.
+card("spacy-sentiws", "https://github.com/Liebeck/spacy-sentiws", "Matthias Liebeck", "github")
| German sentiment scores with SentiWS.
+card("spacy-lefff", "https://github.com/sammous/spacy-lefff", "Sami Moustachir", "github")
| French lemmatization with Lefff.
.u-text-right
+button("https://github.com/topics/spacy-extension?o=desc&s=stars", false, "primary", "small") See more extensions on GitHub
+section("demos")
+h(2, "demos") Demos & Visualizations
+grid
+card("Neural coref", "https://huggingface.co/coref/", "Hugging Face")
+image("/assets/img/resources/neuralcoref.jpg").o-block-small
| State-of-the-art coreference resolution based on neural nets
| and spaCy.
+card("sense2vec", "https://demos.explosion.ai/sense2vec", "Matthew Honnibal and Ines Montani")
+image("/assets/img/resources/sense2vec.jpg").o-block-small
| Semantic analysis of the Reddit hivemind using sense2vec and spaCy.
+card("displaCy", "https://demos.explosion.ai/displacy", "Ines Montani")
+image("/assets/img/resources/displacy.jpg").o-block-small
| An open-source NLP visualiser for the modern web.
+card("displaCy ENT", "https://demos.explosion.ai/displacy-ent", "Ines Montani")
+image("/assets/img/resources/displacy-ent.jpg").o-block-small
| An open-source named entity visualiser for the modern web.
+card("spacy-vis", "http://spacyvis.allennlp.org/spacy-parser", "Mark Neumann")
+image("/assets/img/resources/spacy-vis.jpg").o-block-small
| Visualise spaCy's dependency parses, with part-of-speech tags and
| entities added to node attributes.
+section("books")
+h(2, "books") Books & Courses
+grid
+card("Natural Language Processing Fundamentals in Python", "https://www.datacamp.com/courses/natural-language-processing-fundamentals-in-python", "Katharine Jarmul (Datacamp, 2017)", "course")
| An interactive online course on everything you need to know about
| Natural Language Processing in Python, featuring spaCy and NLTK.
+card("Learning Path: Mastering SpaCy for Natural Language Processing", "https://www.safaribooksonline.com/library/view/learning-path-mastering/9781491986653/", "Aaron Kramer (O'Reilly, 2017)", "course")
| A hands-on introduction to using spaCy to discover insights
| through Natural Language Processing.
+card("Introduction to Machine Learning with Python: A Guide for Data Scientists", "https://books.google.com/books?id=vbQlDQAAQBAJ", "Andreas C. Müller and Sarah Guido (O'Reilly, 2016)", "book")
| Andreas is a lead developer of Scikit-Learn, and Sarah is a lead
| data scientist at Mashable. We're proud to get a mention.
+card("Text Analytics with Python", "https://www.amazon.com/Text-Analytics-Python-Real-World-Actionable/dp/148422387X", "Dipanjan Sarkar (Apress / Springer, 2016)", "book")
| A Practical Real-World Approach to Gaining Actionable Insights
| from your Data
+card("Practical Machine Learning with Python", "", "Dipanjan Sarkar et al. (Apress, 2017)", "book")
| A Problem-Solver's Guide to Building Real-World Intelligent Systems
+section("notebooks")
+h(2, "notebooks") Jupyter notebooks
+grid
+card("Modern NLP in Python", gh("spacy-notebooks", "notebooks/conference_notebooks/modern_nlp_in_python.ipynb"), "Patrick Harrison", "jupyter")
| Introduction to NLP in Python using spaCy and Gensim. Presented
| at PyData DC 2016.
+card("Advanced Text Analysis", gh("spacy-notebooks", "notebooks/conference_notebooks/advanced_text_analysis.ipynb"), "Jonathan Reeve", "jupyter")
| Advanced Text Analysis with spaCy and Scikit-Learn. Presented at
| NYU during NYCDH Week 2017.
.u-text-right
+button(gh("spacy-notebooks"), false, "primary", "small") See more notebooks on GitHub
+section("videos")
+h(2, "videos") Videos
+youtube("sqDHBH9IjRU")
+section("research")
+h(2, "research") Research systems
p Researchers are using spaCy to build ambitious, next-generation text processing technologies. spaCy is particularly popular amongst the biomedical NLP community, who are working on extracting knowledge from the huge volume of literature in their field.
+grid
+card(false, "https://www.semanticscholar.org/paper/Choosing-an-NLP-Library-for-Analyzing-Software-Doc-Omran-Treude/72f280e47e91b30af24205fa24d53247605aa591", "Fouad Nasser A. Al Omran et al. (2017)", "book", "third")
| Choosing an NLP Library for Analyzing Software Documentation: A
| Systematic Literature Review and a Series of Experiments
+card(false, "https://www.semanticscholar.org/paper/Mixing-Dirichlet-Topic-Models-and-Word-Embeddings-Moody/bf8116e06f7b498c6abfbf97aeb67d0838c08609", "Christopher E. Moody (2016)", "book", "third")
| Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec
+card(false, "https://www.semanticscholar.org/paper/Refactoring-the-Genia-Event-Extraction-Shared-Task-Kim-Wang/06d94b64a7bd2d3433f57caddad5084435d6a91f", "Jin-Dong Kim et al. (2016)", "book", "third")
| Refactoring the Genia Event Extraction Shared Task Toward a
| General Framework for IE-Driven KB Development
+card(false, "https://www.semanticscholar.org/paper/Predicting-Pre-click-Quality-for-Native-Zhou-Redi/564985430ff2fbc3a9daa9c2af8997b7f5046da8", "Ke Zhou et al. (2016)", "book", "third")
| Predicting Pre-click Quality for Native Advertisements
+card(false, "https://www.semanticscholar.org/paper/Threat-detection-in-online-discussions-Wester-%C3%98vrelid/f4150e2fb4d8646ebc2ea84f1a86afa1b593239b", "Aksel Wester et al. (2016)", "book", "third")
| Threat detection in online discussions
+card(false, "https://www.semanticscholar.org/paper/Distributional-semantics-for-understanding-spoken-Korpusik-Huang/5f55c5535e80d3e5ed7f1f0b89531e32725faff5", "Mandy Korpusik et al. (2016)", "book", "third")
| Distributional semantics for understanding spoken meal
| descriptions
.u-text-right
+button("https://scholar.google.com/scholar?scisbd=2&q=spacy&hl=en&as_sdt=1,5&as_vis=1", false, "primary", "small")
| See 200+ papers on Google Scholar

View File

@ -189,9 +189,9 @@ p
| the website or company in a specific context.
+aside-code("Loading models", "bash", "$").
python -m spacy download en
python -m spacy download en_core_web_sm
&gt;&gt;&gt; import spacy
&gt;&gt;&gt; nlp = spacy.load('en')
&gt;&gt;&gt; nlp = spacy.load('en_core_web_sm')
p
| Once you've #[+a("/usage/models") downloaded and installed] a model,
@ -200,11 +200,13 @@ p
| to process text. We usually call it #[code nlp]. Calling the #[code nlp]
| object on a string of text will return a processed #[code Doc]:
+code.
+code-exec.
import spacy
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token.text, token.pos_, token.dep_)
p
| Even though a #[code Doc] is processed e.g. split into individual words

View File

@ -3,7 +3,16 @@
include ../_includes/_mixins
+section("basics")
include _vectors-similarity/_basics
+aside("Training word vectors")
| Dense, real valued vectors representing distributional similarity
| information are now a cornerstone of practical NLP. The most common way
| to train these vectors is the #[+a("https://en.wikipedia.org/wiki/Word2vec") word2vec]
| family of algorithms. If you need to train a word2vec model, we recommend
| the implementation in the Python library
| #[+a("https://radimrehurek.com/gensim/") Gensim].
include _spacy-101/_similarity
include _spacy-101/_word-vectors
+section("custom")
+h(2, "custom") Customising word vectors