Merge pull request #1461 from explosion/feature/disable-pipes

💫 Add Language.disable_pipes(), to temporarily edit pipeline and update code examples
This commit is contained in:
Ines Montani 2017-10-27 12:21:40 +02:00 committed by GitHub
commit 4033e70c71
42 changed files with 1570 additions and 1396 deletions

View File

@ -2,20 +2,18 @@
# spaCy examples
The examples are Python scripts with well-behaved command line interfaces. For a full list of spaCy tutorials and code snippets, see the [documentation](https://spacy.io/docs/usage/tutorials).
The examples are Python scripts with well-behaved command line interfaces. For
more detailed usage guides, see the [documentation](https://alpha.spacy.io/usage/).
## How to run an example
For example, to run the [`nn_text_class.py`](nn_text_class.py) script, do:
To see the available arguments, you can use the `--help` or `-h` flag:
```bash
$ python examples/nn_text_class.py
usage: nn_text_class.py [-h] [-d 3] [-H 300] [-i 5] [-w 40000] [-b 24]
[-r 0.3] [-p 1e-05] [-e 0.005]
data_dir
nn_text_class.py: error: too few arguments
$ python examples/training/train_ner.py --help
```
You can print detailed help with the `-h` argument.
While we try to keep the examples up to date, they are not currently exercised by the test suite, as some of them require significant data downloads or take time to train. If you find that an example is no longer running, [please tell us](https://github.com/explosion/spaCy/issues)! We know there's nothing worse than trying to figure out what you're doing wrong, and it turns out your code was never the problem.
While we try to keep the examples up to date, they are not currently exercised
by the test suite, as some of them require significant data downloads or take
time to train. If you find that an example is no longer running,
[please tell us](https://github.com/explosion/spaCy/issues)! We know there's
nothing worse than trying to figure out what you're doing wrong, and it turns
out your code was never the problem.

View File

@ -1,37 +0,0 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from math import sqrt
from numpy import dot
from numpy.linalg import norm
def handle_tweet(spacy, tweet_data, query):
text = tweet_data.get('text', u'')
# Twython returns either bytes or unicode, depending on tweet.
# ಠ_ಠ #APIshaming
try:
match_tweet(spacy, text, query)
except TypeError:
match_tweet(spacy, text.decode('utf8'), query)
def match_tweet(spacy, text, query):
def get_vector(word):
return spacy.vocab[word].repvec
tweet = spacy(text)
tweet = [w.repvec for w in tweet if w.is_alpha and w.lower_ != query]
if tweet:
accept = map(get_vector, 'child classroom teach'.split())
reject = map(get_vector, 'mouth hands giveaway'.split())
y = sum(max(cos(w1, w2), 0) for w1 in tweet for w2 in accept)
n = sum(max(cos(w1, w2), 0) for w1 in tweet for w2 in reject)
if (y / (y + n)) >= 0.5 or True:
print(text)
def cos(v1, v2):
return dot(v1, v2) / (norm(v1) * norm(v2))

View File

@ -1,59 +0,0 @@
"""Issue #252
Question:
In the documents and tutorials the main thing I haven't found is examples on how to break sentences down into small sub thoughts/chunks. The noun_chunks is handy, but having examples on using the token.head to find small (near-complete) sentence chunks would be neat.
Lets take the example sentence on https://displacy.spacy.io/displacy/index.html
displaCy uses CSS and JavaScript to show you how computers understand language
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
[displaCy] uses CSS and Javascript [to + show]
&
show you how computers understand [language]
I'm assuming that we can use the token.head to build these groups. In one of your examples you had the following function.
def dependency_labels_to_root(token):
'''Walk up the syntactic tree, collecting the arc labels.'''
dep_labels = []
while token.head is not token:
dep_labels.append(token.dep)
token = token.head
return dep_labels
"""
from __future__ import print_function, unicode_literals
# Answer:
# The easiest way is to find the head of the subtree you want, and then use the
# `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree` is the
# one that does what you're asking for most directly:
from spacy.en import English
nlp = English()
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
print(''.join(w.text_with_ws for w in word.subtree))
# It'd probably be better for `word.subtree` to return a `Span` object instead
# of a generator over the tokens. If you want the `Span` you can get it via the
# `.right_edge` and `.left_edge` properties. The `Span` object is nice because
# you can easily get a vector, merge it, etc.
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
print(subtree_span.text, '|', subtree_span.root.text)
print(subtree_span.similarity(doc))
print(subtree_span.similarity(subtree_span.root))
# You might also want to select a head, and then select a start and end position by
# walking along its children. You could then take the `.left_edge` and `.right_edge`
# of those tokens, and use it to calculate a span.

View File

@ -1,59 +0,0 @@
import plac
from spacy.en import English
from spacy.parts_of_speech import NOUN
from spacy.parts_of_speech import ADP as PREP
def _span_to_tuple(span):
start = span[0].idx
end = span[-1].idx + len(span[-1])
tag = span.root.tag_
text = span.text
label = span.label_
return (start, end, tag, text, label)
def merge_spans(spans, doc):
# This is a bit awkward atm. What we're doing here is merging the entities,
# so that each only takes up a single token. But an entity is a Span, and
# each Span is a view into the doc. When we merge a span, we invalidate
# the other spans. This will get fixed --- but for now the solution
# is to gather the information first, before merging.
tuples = [_span_to_tuple(span) for span in spans]
for span_tuple in tuples:
doc.merge(*span_tuple)
def extract_currency_relations(doc):
merge_spans(doc.ents, doc)
merge_spans(doc.noun_chunks, doc)
relations = []
for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
if money.dep_ in ('attr', 'dobj'):
subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
if subject:
subject = subject[0]
relations.append((subject, money))
elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
relations.append((money.head.head, money))
return relations
def main():
nlp = English()
texts = [
u'Net income was $9.4 million compared to the prior year of $2.7 million.',
u'Revenue exceeded twelve billion dollars, with a loss of $1b.',
]
for text in texts:
doc = nlp(text)
relations = extract_currency_relations(doc)
for r1, r2 in relations:
print(r1.text, r2.ent_type_, r2.text)
if __name__ == '__main__':
plac.call(main)

View File

@ -0,0 +1,62 @@
#!/usr/bin/env python
# coding: utf8
"""
A simple example of extracting relations between phrases and entities using
spaCy's named entity recognizer and the dependency parse. Here, we extract
money and currency values (entities labelled as MONEY) and then check the
dependency tree to find the noun phrase they are referring to for example:
$9.4 million --> Net income.
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import spacy
TEXTS = [
'Net income was $9.4 million compared to the prior year of $2.7 million.',
'Revenue exceeded twelve billion dollars, with a loss of $1b.',
]
@plac.annotations(
model=("Model to load (needs parser and NER)", "positional", None, str))
def main(model='en_core_web_sm'):
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
print("Processing %d texts" % len(TEXTS))
for text in TEXTS:
doc = nlp(text)
relations = extract_currency_relations(doc)
for r1, r2 in relations:
print('{:<10}\t{}\t{}'.format(r1.text, r2.ent_type_, r2.text))
def extract_currency_relations(doc):
# merge entities and noun chunks into one token
for span in [*list(doc.ents), *list(doc.noun_chunks)]:
span.merge()
relations = []
for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
if money.dep_ in ('attr', 'dobj'):
subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
if subject:
subject = subject[0]
relations.append((subject, money))
elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
relations.append((money.head.head, money))
return relations
if __name__ == '__main__':
plac.call(main)
# Expected output:
# Net income MONEY $9.4 million
# the prior year MONEY $2.7 million
# Revenue MONEY twelve billion dollars
# a loss MONEY 1b

View File

@ -0,0 +1,65 @@
#!/usr/bin/env python
# coding: utf8
"""
This example shows how to navigate the parse tree including subtrees attached
to a word.
Based on issue #252:
"In the documents and tutorials the main thing I haven't found is
examples on how to break sentences down into small sub thoughts/chunks. The
noun_chunks is handy, but having examples on using the token.head to find small
(near-complete) sentence chunks would be neat. Lets take the example sentence:
"displaCy uses CSS and JavaScript to show you how computers understand language"
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
[displaCy] uses CSS and Javascript [to + show]
show you how computers understand [language]
I'm assuming that we can use the token.head to build these groups."
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import spacy
@plac.annotations(
model=("Model to load", "positional", None, str))
def main(model='en_core_web_sm'):
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
doc = nlp("displaCy uses CSS and JavaScript to show you how computers "
"understand language")
# The easiest way is to find the head of the subtree you want, and then use
# the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree`
# is the one that does what you're asking for most directly:
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
print(''.join(w.text_with_ws for w in word.subtree))
# It'd probably be better for `word.subtree` to return a `Span` object
# instead of a generator over the tokens. If you want the `Span` you can
# get it via the `.right_edge` and `.left_edge` properties. The `Span`
# object is nice because you can easily get a vector, merge it, etc.
for word in doc:
if word.dep_ in ('xcomp', 'ccomp'):
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
print(subtree_span.text, '|', subtree_span.root.text)
# You might also want to select a head, and then select a start and end
# position by walking along its children. You could then take the
# `.left_edge` and `.right_edge` of those tokens, and use it to calculate
# a span.
if __name__ == '__main__':
plac.call(main)
# Expected output:
# to show you how computers understand language
# how computers understand language
# to show you how computers understand language | show
# how computers understand language | understand

View File

@ -4,22 +4,24 @@ The idea is to associate each word in the vocabulary with a tag, noting whether
they begin, end, or are inside at least one pattern. An additional tag is used
for single-word patterns. Complete patterns are also stored in a hash set.
When we process a document, we look up the words in the vocabulary, to associate
the words with the tags. We then search for tag-sequences that correspond to
valid candidates. Finally, we look up the candidates in the hash set.
When we process a document, we look up the words in the vocabulary, to
associate the words with the tags. We then search for tag-sequences that
correspond to valid candidates. Finally, we look up the candidates in the hash
set.
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary Clinton", we
would associate "Barack" and "Hilary" with the B tag, Hussein with the I tag,
and Obama and Clinton with the L tag.
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary
Clinton", we would associate "Barack" and "Hilary" with the B tag, Hussein with
the I tag, and Obama and Clinton with the L tag.
The document "Barack Clinton and Hilary Clinton" would have the tag sequence
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second candidate
is in the phrase dictionary, so only one is returned as a match.
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second
candidate is in the phrase dictionary, so only one is returned as a match.
The algorithm is O(n) at run-time for document of length n because we're only ever
matching over the tag patterns. So no matter how many phrases we're looking for,
our pattern set stays very small (exact size depends on the maximum length we're
looking for, as the query language currently has no quantifiers)
The algorithm is O(n) at run-time for document of length n because we're only
ever matching over the tag patterns. So no matter how many phrases we're
looking for, our pattern set stays very small (exact size depends on the
maximum length we're looking for, as the query language currently has no
quantifiers).
The example expects a .bz2 file from the Reddit corpus, and a patterns file,
formatted in jsonl as a sequence of entries like this:
@ -32,11 +34,9 @@ formatted in jsonl as a sequence of entries like this:
{"text":"Argentina"}
"""
from __future__ import print_function, unicode_literals, division
from bz2 import BZ2File
import time
import math
import codecs
import plac
import ujson
@ -44,6 +44,24 @@ from spacy.matcher import PhraseMatcher
import spacy
@plac.annotations(
patterns_loc=("Path to gazetteer", "positional", None, str),
text_loc=("Path to Reddit corpus file", "positional", None, str),
n=("Number of texts to read", "option", "n", int),
lang=("Language class to initialise", "option", "l", str))
def main(patterns_loc, text_loc, n=10000, lang='en'):
nlp = spacy.blank('en')
nlp.vocab.lex_attr_getters = {}
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
count = 0
t1 = time.time()
for ent_id, text in get_matches(nlp.tokenizer, phrases,
read_text(text_loc, n=n)):
count += 1
t2 = time.time()
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
def read_gazetteer(tokenizer, loc, n=-1):
for i, line in enumerate(open(loc)):
data = ujson.loads(line.strip())
@ -75,18 +93,6 @@ def get_matches(tokenizer, phrases, texts, max_length=6):
yield (ent_id, doc[start:end].text)
def main(patterns_loc, text_loc, n=10000):
nlp = spacy.blank('en')
nlp.vocab.lex_attr_getters = {}
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
count = 0
t1 = time.time()
for ent_id, text in get_matches(nlp.tokenizer, phrases, read_text(text_loc, n=n)):
count += 1
t2 = time.time()
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
if __name__ == '__main__':
if False:
import cProfile

View File

@ -1,5 +0,0 @@
An example of inventory counting using SpaCy.io NLP library. Meant to show how to instantiate Spacy's English class, and allow reusability by reloading the main module.
In the future, a better implementation of this library would be to apply machine learning to each query and learn what to classify as the quantitative statement (55 kgs OF), vs the actual item of count (how likely is a preposition object to be the item of count if x,y,z qualifications appear in the statement).

View File

@ -1,35 +0,0 @@
class Inventory:
"""
Inventory class - a struct{} like feature to house inventory counts
across modules.
"""
originalQuery = None
item = ""
unit = ""
amount = ""
def __init__(self, statement):
"""
Constructor - only takes in the original query/statement
:return: new Inventory object
"""
self.originalQuery = statement
pass
def __str__(self):
return str(self.amount) + ' ' + str(self.unit) + ' ' + str(self.item)
def printInfo(self):
print '-------------Inventory Count------------'
print "Original Query: " + str(self.originalQuery)
print 'Amount: ' + str(self.amount)
print 'Unit: ' + str(self.unit)
print 'Item: ' + str(self.item)
print '----------------------------------------'
def isValid(self):
if not self.item or not self.unit or not self.amount or not self.originalQuery:
return False
else:
return True

View File

@ -1,92 +0,0 @@
from inventory import Inventory
def runTest(nlp):
testset = []
testset += [nlp(u'6 lobster cakes')]
testset += [nlp(u'6 avacados')]
testset += [nlp(u'fifty five carrots')]
testset += [nlp(u'i have 55 carrots')]
testset += [nlp(u'i got me some 9 cabbages')]
testset += [nlp(u'i got 65 kgs of carrots')]
result = []
for doc in testset:
c = decodeInventoryEntry_level1(doc)
if not c.isValid():
c = decodeInventoryEntry_level2(doc)
result.append(c)
for i in result:
i.printInfo()
def decodeInventoryEntry_level1(document):
"""
Decodes a basic entry such as: '6 lobster cake' or '6' cakes
@param document : NLP Doc object
:return: Status if decoded correctly (true, false), and Inventory object
"""
count = Inventory(str(document))
for token in document:
if token.pos_ == (u'NOUN' or u'NNS' or u'NN'):
item = str(token)
for child in token.children:
if child.dep_ == u'compound' or child.dep_ == u'ad':
item = str(child) + str(item)
elif child.dep_ == u'nummod':
count.amount = str(child).strip()
for numerical_child in child.children:
# this isn't arithmetic rather than treating it such as a string
count.amount = str(numerical_child) + str(count.amount).strip()
else:
print "WARNING: unknown child: " + str(child) + ':'+str(child.dep_)
count.item = item
count.unit = item
return count
def decodeInventoryEntry_level2(document):
"""
Entry level 2, a more complicated parsing scheme that covers examples such as
'i have 80 boxes of freshly baked pies'
@document @param document : NLP Doc object
:return: Status if decoded correctly (true, false), and Inventory object-
"""
count = Inventory(str(document))
for token in document:
# Look for a preposition object that is a noun (this is the item we are counting).
# If found, look at its' dependency (if a preposition that is not indicative of
# inventory location, the dependency of the preposition must be a noun
if token.dep_ == (u'pobj' or u'meta') and token.pos_ == (u'NOUN' or u'NNS' or u'NN'):
item = ''
# Go through all the token's children, these are possible adjectives and other add-ons
# this deals with cases such as 'hollow rounded waffle pancakes"
for i in token.children:
item += ' ' + str(i)
item += ' ' + str(token)
count.item = item
# Get the head of the item:
if token.head.dep_ != u'prep':
# Break out of the loop, this is a confusing entry
break
else:
amountUnit = token.head.head
count.unit = str(amountUnit)
for inner in amountUnit.children:
if inner.pos_ == u'NUM':
count.amount += str(inner)
return count

View File

@ -1,30 +0,0 @@
import inventoryCount as mainModule
import os
from spacy.en import English
if __name__ == '__main__':
"""
Main module for this example - loads the English main NLP class,
and keeps it in RAM while waiting for the user to re-run it. Allows the
developer to re-edit their module under testing without having
to wait as long to load the English class
"""
# Set the NLP object here for the parameters you want to see,
# or just leave it blank and get all the opts
print "Loading English module... this will take a while."
nlp = English()
print "Done loading English module."
while True:
try:
reload(mainModule)
mainModule.runTest(nlp)
raw_input('================ To reload main module, press Enter ================')
except Exception, e:
print "Unexpected error: " + str(e)
continue

View File

@ -1,161 +0,0 @@
from __future__ import unicode_literals, print_function
import spacy.en
import spacy.matcher
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
import plac
def main():
nlp = spacy.en.English()
example = u"I prefer Siri to Google Now. I'll google now to find out how the google now service works."
before = nlp(example)
print("Before")
for ent in before.ents:
print(ent.text, ent.label_, [w.tag_ for w in ent])
# Output:
# Google ORG [u'NNP']
# google ORG [u'VB']
# google ORG [u'NNP']
nlp.matcher.add(
"GoogleNow", # Entity ID: Not really used at the moment.
"PRODUCT", # Entity type: should be one of the types in the NER data
{"wiki_en": "Google_Now"}, # Arbitrary attributes. Currently unused.
[ # List of patterns that can be Surface Forms of the entity
# This Surface Form matches "Google Now", verbatim
[ # Each Surface Form is a list of Token Specifiers.
{ # This Token Specifier matches tokens whose orth field is "Google"
ORTH: "Google"
},
{ # This Token Specifier matches tokens whose orth field is "Now"
ORTH: "Now"
}
],
[ # This Surface Form matches "google now", verbatim, and requires
# "google" to have the NNP tag. This helps prevent the pattern from
# matching cases like "I will google now to look up the time"
{
ORTH: "google",
TAG: "NNP"
},
{
ORTH: "now"
}
]
]
)
after = nlp(example)
print("After")
for ent in after.ents:
print(ent.text, ent.label_, [w.tag_ for w in ent])
# Output
# Google Now PRODUCT [u'NNP', u'RB']
# google ORG [u'VB']
# google now PRODUCT [u'NNP', u'RB']
#
# You can customize attribute values in the lexicon, and then refer to the
# new attributes in your Token Specifiers.
# This is particularly good for word-set membership.
#
australian_capitals = ['Brisbane', 'Sydney', 'Canberra', 'Melbourne', 'Hobart',
'Darwin', 'Adelaide', 'Perth']
# Internally, the tokenizer immediately maps each token to a pointer to a
# LexemeC struct. These structs hold various features, e.g. the integer IDs
# of the normalized string forms.
# For our purposes, the key attribute is a 64-bit integer, used as a bit field.
# spaCy currently only uses 12 of the bits for its built-in features, so
# the others are available for use. It's best to use the higher bits, as
# future versions of spaCy may add more flags. For instance, we might add
# a built-in IS_MONTH flag, taking up FLAG13. So, we bind our user-field to
# FLAG63 here.
is_australian_capital = FLAG63
# Now we need to set the flag value. It's False on all tokens by default,
# so we just need to set it to True for the tokens we want.
# Here we iterate over the strings, and set it on only the literal matches.
for string in australian_capitals:
lexeme = nlp.vocab[string]
lexeme.set_flag(is_australian_capital, True)
print('Sydney', nlp.vocab[u'Sydney'].check_flag(is_australian_capital))
print('sydney', nlp.vocab[u'sydney'].check_flag(is_australian_capital))
# If we want case-insensitive matching, we have to be a little bit more
# round-about, as there's no case-insensitive index to the vocabulary. So
# we have to iterate over the vocabulary.
# We'll be looking up attribute IDs in this set a lot, so it's good to pre-build it
target_ids = {nlp.vocab.strings[s.lower()] for s in australian_capitals}
for lexeme in nlp.vocab:
if lexeme.lower in target_ids:
lexeme.set_flag(is_australian_capital, True)
print('Sydney', nlp.vocab[u'Sydney'].check_flag(is_australian_capital))
print('sydney', nlp.vocab[u'sydney'].check_flag(is_australian_capital))
print('SYDNEY', nlp.vocab[u'SYDNEY'].check_flag(is_australian_capital))
# Output
# Sydney True
# sydney False
# Sydney True
# sydney True
# SYDNEY True
#
# The key thing to note here is that we're setting these attributes once,
# over the vocabulary --- and then reusing them at run-time. This means the
# amortized complexity of anything we do this way is going to be O(1). You
# can match over expressions that need to have sets with tens of thousands
# of values, e.g. "all the street names in Germany", and you'll still have
# O(1) complexity. Most regular expression algorithms don't scale well to
# this sort of problem.
#
# Now, let's use this in a pattern
nlp.matcher.add("AuCitySportsTeam", "ORG", {},
[
[
{LOWER: "the"},
{is_australian_capital: True},
{TAG: "NNS"}
],
[
{LOWER: "the"},
{is_australian_capital: True},
{TAG: "NNPS"}
],
[
{LOWER: "the"},
{IS_ALPHA: True}, # Allow a word in between, e.g. The Western Sydney
{is_australian_capital: True},
{TAG: "NNS"}
],
[
{LOWER: "the"},
{IS_ALPHA: True}, # Allow a word in between, e.g. The Western Sydney
{is_australian_capital: True},
{TAG: "NNPS"}
]
])
doc = nlp(u'The pattern should match the Brisbane Broncos and the South Darwin Spiders, but not the Colorado Boulders')
for ent in doc.ents:
print(ent.text, ent.label_)
# Output
# the Brisbane Broncos ORG
# the South Darwin Spiders ORG
# Output
# Before
# Google ORG [u'NNP']
# google ORG [u'VB']
# google ORG [u'NNP']
# After
# Google Now PRODUCT [u'NNP', u'RB']
# google ORG [u'VB']
# google now PRODUCT [u'NNP', u'RB']
# Sydney True
# sydney False
# Sydney True
# sydney True
# SYDNEY True
# the Brisbane Broncos ORG
# the South Darwin Spiders ORG
if __name__ == '__main__':
main()

View File

@ -1,74 +0,0 @@
from __future__ import print_function, unicode_literals, division
import io
import bz2
import logging
from toolz import partition
from os import path
import re
import spacy.en
from spacy.tokens import Doc
from joblib import Parallel, delayed
import plac
import ujson
def parallelize(func, iterator, n_jobs, extra, backend='multiprocessing'):
extra = tuple(extra)
return Parallel(n_jobs=n_jobs, backend=backend)(delayed(func)(*(item + extra))
for item in iterator)
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield ujson.loads(line)['body']
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
def strip_meta(text):
text = link_re.sub(r'\1', text)
text = text.replace('&gt;', '>').replace('&lt;', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
return text.strip()
def save_parses(batch_id, input_, out_dir, n_threads, batch_size):
out_loc = path.join(out_dir, '%d.bin' % batch_id)
if path.exists(out_loc):
return None
print('Batch', batch_id)
nlp = spacy.en.English()
nlp.matcher = None
with open(out_loc, 'wb') as file_:
texts = (strip_meta(text) for text in input_)
texts = (text for text in texts if text.strip())
for doc in nlp.pipe(texts, batch_size=batch_size, n_threads=n_threads):
file_.write(doc.to_bytes())
@plac.annotations(
in_loc=("Location of input file"),
out_dir=("Location of input file"),
n_process=("Number of processes", "option", "p", int),
n_thread=("Number of threads per process", "option", "t", int),
batch_size=("Number of texts to accumulate in a buffer", "option", "b", int)
)
def main(in_loc, out_dir, n_process=1, n_thread=4, batch_size=100):
if not path.exists(out_dir):
path.join(out_dir)
if n_process >= 2:
texts = partition(200000, iter_comments(in_loc))
parallelize(save_parses, enumerate(texts), n_process, [out_dir, n_thread, batch_size],
backend='multiprocessing')
else:
save_parses(0, iter_comments(in_loc), out_dir, n_thread, batch_size)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,35 +1,60 @@
#!/usr/bin/env python
# coding: utf-8
"""This example contains several snippets of methods that can be set via custom
Doc, Token or Span attributes in spaCy v2.0. Attribute methods act like
they're "bound" to the object and are partially applied i.e. the object
they're called on is passed in as the first argument."""
from __future__ import unicode_literals
they're called on is passed in as the first argument.
* Custom pipeline components: https://alpha.spacy.io//usage/processing-pipelines#custom-components
Developed for: spaCy 2.0.0a17
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
from spacy.lang.en import English
from spacy.tokens import Doc, Span
from spacy import displacy
from pathlib import Path
@plac.annotations(
output_dir=("Output directory for saved HTML", "positional", None, Path))
def main(output_dir=None):
nlp = English() # start off with blank English class
Doc.set_extension('overlap', method=overlap_tokens)
doc1 = nlp(u"Peach emoji is where it has always been.")
doc2 = nlp(u"Peach is the superior emoji.")
print("Text 1:", doc1.text)
print("Text 2:", doc2.text)
print("Overlapping tokens:", doc1._.overlap(doc2))
Doc.set_extension('to_html', method=to_html)
doc = nlp(u"This is a sentence about Apple.")
# add entity manually for demo purposes, to make it work without a model
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings['ORG'])]
print("Text:", doc.text)
doc._.to_html(output=output_dir, style='ent')
def to_html(doc, output='/tmp', style='dep'):
"""Doc method extension for saving the current state as a displaCy
visualization.
"""
# generate filename from first six non-punct tokens
file_name = '-'.join([w.text for w in doc[:6] if not w.is_punct]) + '.html'
output_path = Path(output) / file_name
html = displacy.render(doc, style=style, page=True) # render markup
output_path.open('w', encoding='utf-8').write(html) # save to file
print('Saved HTML to {}'.format(output_path))
Doc.set_extension('to_html', method=to_html)
nlp = English()
doc = nlp(u"This is a sentence about Apple.")
# add entity manually for demo purposes, to make it work without a model
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings['ORG'])]
doc._.to_html(style='ent')
if output is not None:
output_path = Path(output)
if not output_path.exists():
output_path.mkdir()
output_file = Path(output) / file_name
output_file.open('w', encoding='utf-8').write(html) # save to file
print('Saved HTML to {}'.format(output_file))
else:
print(html)
def overlap_tokens(doc, other_doc):
@ -43,10 +68,10 @@ def overlap_tokens(doc, other_doc):
return overlap
Doc.set_extension('overlap', method=overlap_tokens)
if __name__ == '__main__':
plac.call(main)
nlp = English()
doc1 = nlp(u"Peach emoji is where it has always been.")
doc2 = nlp(u"Peach is the superior emoji.")
tokens = doc1._.overlap(doc2)
print(tokens)
# Expected output:
# Text 1: Peach emoji is where it has always been.
# Text 2: Peach is the superior emoji.
# Overlapping tokens: [Peach, emoji, is, .]

View File

@ -1,21 +1,45 @@
# coding: utf-8
from __future__ import unicode_literals
#!/usr/bin/env python
# coding: utf8
"""Example of a spaCy v2.0 pipeline component that requests all countries via
the REST Countries API, merges country names into one token, assigns entity
labels and sets attributes on country tokens, e.g. the capital and lat/lng
coordinates. Can be extended with more details from the API.
* REST Countries API: https://restcountries.eu (Mozilla Public License MPL 2.0)
* Custom pipeline components: https://alpha.spacy.io//usage/processing-pipelines#custom-components
Developed for: spaCy 2.0.0a17
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import requests
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
class RESTCountriesComponent(object):
"""Example of a spaCy v2.0 pipeline component that requests all countries
via the REST Countries API, merges country names into one token, assigns
entity labels and sets attributes on country tokens, e.g. the capital and
lat/lng coordinates. Can be extended with more details from the API.
def main():
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = English()
rest_countries = RESTCountriesComponent(nlp) # initialise component
nlp.add_pipe(rest_countries) # add it to the pipeline
doc = nlp(u"Some text about Colombia and the Czech Republic")
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Doc has countries', doc._.has_country) # Doc contains countries
for token in doc:
if token._.is_country:
print(token.text, token._.country_capital, token._.country_latlng,
token._.country_flag) # country data
print('Entities', [(e.text, e.label_) for e in doc.ents]) # entities
REST Countries API: https://restcountries.eu
API License: Mozilla Public License MPL 2.0
class RESTCountriesComponent(object):
"""spaCy v2.0 pipeline component that requests all countries via
the REST Countries API, merges country names into one token, assigns entity
labels and sets attributes on country tokens.
"""
name = 'rest_countries' # component name, will show up in the pipeline
@ -90,19 +114,12 @@ class RESTCountriesComponent(object):
return any([t._.get('is_country') for t in tokens])
# For simplicity, we start off with only the blank English Language class and
# no model or pre-defined pipeline loaded.
if __name__ == '__main__':
plac.call(main)
nlp = English()
rest_countries = RESTCountriesComponent(nlp) # initialise component
nlp.add_pipe(rest_countries) # add it to the pipeline
doc = nlp(u"Some text about Colombia and the Czech Republic")
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Doc has countries', doc._.has_country) # Doc contains countries
for token in doc:
if token._.is_country:
print(token.text, token._.country_capital, token._.country_latlng,
token._.country_flag) # country data
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all countries are entities
# Expected output:
# Pipeline ['rest_countries']
# Doc has countries True
# Colombia Bogotá [4.0, -72.0] https://restcountries.eu/data/col.svg
# Czech Republic Prague [49.75, 15.5] https://restcountries.eu/data/cze.svg
# Entities [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]

View File

@ -1,11 +1,45 @@
# coding: utf-8
from __future__ import unicode_literals
#!/usr/bin/env python
# coding: utf8
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
based on list of single or multiple-word company names. Companies are
labelled as ORG and their spans are merged into one token. Additionally,
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
respectively.
* Custom pipeline components: https://alpha.spacy.io//usage/processing-pipelines#custom-components
Developed for: spaCy 2.0.0a17
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
@plac.annotations(
text=("Text to process", "positional", None, str),
companies=("Names of technology companies", "positional", None, str))
def main(text="Alphabet Inc. is the company behind Google.", *companies):
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = English()
if not companies: # set default companies if none are set via args
companies = ['Alphabet Inc.', 'Google', 'Netflix', 'Apple'] # etc.
component = TechCompanyRecognizer(nlp, companies) # initialise component
nlp.add_pipe(component, last=True) # add last to the pipeline
doc = nlp(text)
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Tokens', [t.text for t in doc]) # company names from the list are merged
print('Doc has_tech_org', doc._.has_tech_org) # Doc contains tech orgs
print('Token 0 is_tech_org', doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
print('Token 1 is_tech_org', doc[1]._.is_tech_org) # "is" is not
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
class TechCompanyRecognizer(object):
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
based on list of single or multiple-word company names. Companies are
@ -67,19 +101,13 @@ class TechCompanyRecognizer(object):
return any([t._.get('is_tech_org') for t in tokens])
# For simplicity, we start off with only the blank English Language class and
# no model or pre-defined pipeline loaded.
if __name__ == '__main__':
plac.call(main)
nlp = English()
companies = ['Alphabet Inc.', 'Google', 'Netflix', 'Apple'] # etc.
component = TechCompanyRecognizer(nlp, companies) # initialise component
nlp.add_pipe(component, last=True) # add it to the pipeline as the last element
doc = nlp(u"Alphabet Inc. is the company behind Google.")
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Tokens', [t.text for t in doc]) # company names from the list are merged
print('Doc has_tech_org', doc._.has_tech_org) # Doc contains tech orgs
print('Token 0 is_tech_org', doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
print('Token 1 is_tech_org', doc[1]._.is_tech_org) # "is" is not
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
# Expected output:
# Pipeline ['tech_companies']
# Tokens ['Alphabet Inc.', 'is', 'the', 'company', 'behind', 'Google', '.']
# Doc has_tech_org True
# Token 0 is_tech_org True
# Token 1 is_tech_org False
# Entities [('Alphabet Inc.', 'ORG'), ('Google', 'ORG')]

View File

@ -0,0 +1,73 @@
"""
Example of multi-processing with Joblib. Here, we're exporting
part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with
each "sentence" on a newline, and spaces between tokens. Data is loaded from
the IMDB movie reviews dataset and will be loaded automatically via Thinc's
built-in dataset loader.
Last updated for: spaCy 2.0.0a18
"""
from __future__ import print_function, unicode_literals
from toolz import partition_all
from pathlib import Path
from joblib import Parallel, delayed
import thinc.extra.datasets
import plac
import spacy
@plac.annotations(
output_dir=("Output directory", "positional", None, Path),
model=("Model name (needs tagger)", "positional", None, str),
n_jobs=("Number of workers", "option", "n", int),
batch_size=("Batch-size for each process", "option", "b", int),
limit=("Limit of entries from the dataset", "option", "l", int))
def main(output_dir, model='en_core_web_sm', n_jobs=4, batch_size=1000,
limit=10000):
nlp = spacy.load(model) # load spaCy model
print("Loaded model '%s'" % model)
if not output_dir.exists():
output_dir.mkdir()
# load and pre-process the IMBD dataset
print("Loading IMDB data...")
data, _ = thinc.extra.datasets.imdb()
texts, _ = zip(*data[-limit:])
partitions = partition_all(batch_size, texts)
items = ((i, [nlp(text) for text in texts], output_dir) for i, texts
in enumerate(partitions))
Parallel(n_jobs=n_jobs)(delayed(transform_texts)(*item) for item in items)
def transform_texts(batch_id, docs, output_dir):
out_path = Path(output_dir) / ('%d.txt' % batch_id)
if out_path.exists(): # return None in case same batch is called again
return None
print('Processing batch', batch_id)
with out_path.open('w', encoding='utf8') as f:
for doc in docs:
f.write(' '.join(represent_word(w) for w in doc if not w.is_space))
f.write('\n')
print('Saved {} texts to {}.txt'.format(len(docs), batch_id))
def represent_word(word):
text = word.text
# True-case, i.e. try to normalize sentence-initial capitals.
# Only do this if the lower-cased form is more probable.
if text.istitle() and is_sent_begin(word) \
and word.prob < word.doc.vocab[text.lower()].prob:
text = text.lower()
return text + '|' + word.tag_
def is_sent_begin(word):
if word.i == 0:
return True
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
return True
else:
return False
if __name__ == '__main__':
plac.call(main)

View File

@ -1,90 +0,0 @@
"""
Print part-of-speech tagged, true-cased, (very roughly) sentence-separated
text, with each "sentence" on a newline, and spaces between tokens. Supports
multi-processing.
"""
from __future__ import print_function, unicode_literals, division
import io
import bz2
import logging
from toolz import partition
from os import path
import spacy.en
from joblib import Parallel, delayed
import plac
import ujson
def parallelize(func, iterator, n_jobs, extra):
extra = tuple(extra)
return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
def iter_texts_from_json_bz2(loc):
"""
Iterator of unicode strings, one per document (here, a comment).
Expects a a path to a BZ2 file, which should be new-line delimited JSON. The
document text should be in a string field titled 'body'.
This is the data format of the Reddit comments corpus.
"""
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield ujson.loads(line)['body']
def transform_texts(batch_id, input_, out_dir):
out_loc = path.join(out_dir, '%d.txt' % batch_id)
if path.exists(out_loc):
return None
print('Batch', batch_id)
nlp = spacy.en.English(parser=False, entity=False)
with io.open(out_loc, 'w', encoding='utf8') as file_:
for text in input_:
doc = nlp(text)
file_.write(' '.join(represent_word(w) for w in doc if not w.is_space))
file_.write('\n')
def represent_word(word):
text = word.text
# True-case, i.e. try to normalize sentence-initial capitals.
# Only do this if the lower-cased form is more probable.
if text.istitle() \
and is_sent_begin(word) \
and word.prob < word.doc.vocab[text.lower()].prob:
text = text.lower()
return text + '|' + word.tag_
def is_sent_begin(word):
# It'd be nice to have some heuristics like these in the library, for these
# times where we don't care so much about accuracy of SBD, and we don't want
# to parse
if word.i == 0:
return True
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
return True
else:
return False
@plac.annotations(
in_loc=("Location of input file"),
out_dir=("Location of input file"),
n_workers=("Number of workers", "option", "n", int),
batch_size=("Batch-size for each process", "option", "b", int)
)
def main(in_loc, out_dir, n_workers=4, batch_size=100000):
if not path.exists(out_dir):
path.join(out_dir)
texts = partition(batch_size, iter_texts_from_json_bz2(in_loc))
parallelize(transform_texts, enumerate(texts), n_workers, [out_dir])
if __name__ == '__main__':
plac.call(main)

View File

@ -1,22 +0,0 @@
# Load NER
from __future__ import unicode_literals
import spacy
import pathlib
from spacy.pipeline import EntityRecognizer
from spacy.vocab import Vocab
def load_model(model_dir):
model_dir = pathlib.Path(model_dir)
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
with (model_dir / 'vocab' / 'strings.json').open('r', encoding='utf8') as file_:
nlp.vocab.strings.load(file_)
nlp.vocab.load_lexemes(model_dir / 'vocab' / 'lexemes.bin')
ner = EntityRecognizer.load(model_dir, nlp.vocab, require=True)
return (nlp, ner)
(nlp, ner) = load_model('ner')
doc = nlp.make_doc('Who is Shaka Khan?')
nlp.tagger(doc)
ner(doc)
for word in doc:
print(word.text, word.orth, word.lower, word.tag_, word.ent_type_, word.ent_iob)

View File

@ -0,0 +1,157 @@
#!/usr/bin/env python
# coding: utf-8
"""Using the parser to recognise your own semantics
spaCy's parser component can be used to trained to predict any type of tree
structure over your input text. You can also predict trees over whole documents
or chat logs, with connections between the sentence-roots used to annotate
discourse structure. In this example, we'll build a message parser for a common
"chat intent": finding local businesses. Our message semantics will have the
following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION.
"show me the best hotel in berlin"
('show', 'ROOT', 'show')
('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
('hotel', 'PLACE', 'show') --> show PLACE hotel
('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin
"""
from __future__ import unicode_literals, print_function
import plac
import random
import spacy
from spacy.gold import GoldParse
from spacy.tokens import Doc
from pathlib import Path
# training data: words, head and dependency labels
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
TRAIN_DATA = [
(
['find', 'a', 'cafe', 'with', 'great', 'wifi'],
[0, 2, 0, 5, 5, 2], # index of token head
['ROOT', '-', 'PLACE', '-', 'QUALITY', 'ATTRIBUTE']
),
(
['find', 'a', 'hotel', 'near', 'the', 'beach'],
[0, 2, 0, 5, 5, 2],
['ROOT', '-', 'PLACE', 'QUALITY', '-', 'ATTRIBUTE']
),
(
['find', 'me', 'the', 'closest', 'gym', 'that', "'s", 'open', 'late'],
[0, 0, 4, 4, 0, 6, 4, 6, 6],
['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'ATTRIBUTE', 'TIME']
),
(
['show', 'me', 'the', 'cheapest', 'store', 'that', 'sells', 'flowers'],
[0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'PRODUCT']
),
(
['find', 'a', 'nice', 'restaurant', 'in', 'london'],
[0, 3, 3, 0, 3, 3],
['ROOT', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
),
(
['show', 'me', 'the', 'coolest', 'hostel', 'in', 'berlin'],
[0, 0, 4, 4, 0, 4, 4],
['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
),
(
['find', 'a', 'good', 'italian', 'restaurant', 'near', 'work'],
[0, 4, 4, 4, 0, 4, 5],
['ROOT', '-', 'QUALITY', 'ATTRIBUTE', 'PLACE', 'ATTRIBUTE', 'LOCATION']
)
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'parser' not in nlp.pipe_names:
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
parser = nlp.get_pipe('parser')
for _, _, deps in TRAIN_DATA:
for dep in deps:
parser.add_label(dep)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training(lambda: [])
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for words, heads, deps in TRAIN_DATA:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
nlp.update([doc], [gold], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_model(nlp)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
test_model(nlp2)
def test_model(nlp):
texts = ["find a hotel with good wifi",
"find me the cheapest gym near work",
"show me the best hotel in berlin"]
docs = nlp.pipe(texts)
for doc in docs:
print(doc.text)
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])
if __name__ == '__main__':
plac.call(main)
# Expected output:
# find a hotel with good wifi
# [
# ('find', 'ROOT', 'find'),
# ('hotel', 'PLACE', 'find'),
# ('good', 'QUALITY', 'wifi'),
# ('wifi', 'ATTRIBUTE', 'hotel')
# ]
# find me the cheapest gym near work
# [
# ('find', 'ROOT', 'find'),
# ('cheapest', 'QUALITY', 'gym'),
# ('gym', 'PLACE', 'find')
# ]
# show me the best hotel in berlin
# [
# ('show', 'ROOT', 'show'),
# ('best', 'QUALITY', 'hotel'),
# ('hotel', 'PLACE', 'show'),
# ('berlin', 'LOCATION', 'hotel')
# ]

View File

@ -1,13 +1,103 @@
#!/usr/bin/env python
# coding: utf8
"""
Example of training spaCy's named entity recognizer, starting off with an
existing model or a blank model.
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* NER: https://alpha.spacy.io/usage/linguistic-features#named-entities
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
from spacy.lang.en import English
import spacy
from spacy.gold import GoldParse, biluo_tags_from_offsets
# training data
TRAIN_DATA = [
('Who is Shaka Khan?', [(7, 17, 'PERSON')]),
('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# function that allows begin_training to get the training data
get_data = lambda: reformat_train_data(nlp.tokenizer, TRAIN_DATA)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training(get_data)
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for raw_text, entity_offsets in TRAIN_DATA:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.update(
[doc], # Batch of Doc objects
[gold], # Batch of GoldParse objects
drop=0.5, # Dropout -- make it harder to memorise data
sgd=optimizer, # Callable to update weights
losses=losses)
print(losses)
# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
def reformat_train_data(tokenizer, examples):
"""Reformat data to match JSON format"""
"""Reformat data to match JSON format.
https://alpha.spacy.io/api/annotation#json-input
tokenizer (Tokenizer): Tokenizer to process the raw text.
examples (list): The trainig data.
RETURNS (list): The reformatted training data."""
output = []
for i, (text, entity_offsets) in enumerate(examples):
doc = tokenizer(text)
@ -21,49 +111,5 @@ def reformat_train_data(tokenizer, examples):
return output
def main(model_dir=None):
train_data = [
(
'Who is Shaka Khan?',
[(len('Who is '), len('Who is Shaka Khan'), 'PERSON')]
),
(
'I like London and Berlin.',
[(len('I like '), len('I like London'), 'LOC'),
(len('I like London and '), len('I like London and Berlin'), 'LOC')]
)
]
nlp = English(pipeline=['tensorizer', 'ner'])
get_data = lambda: reformat_train_data(nlp.tokenizer, train_data)
optimizer = nlp.begin_training(get_data)
for itn in range(100):
random.shuffle(train_data)
losses = {}
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.update(
[doc], # Batch of Doc objects
[gold], # Batch of GoldParse objects
drop=0.5, # Dropout -- make it harder to memorise data
sgd=optimizer, # Callable to update weights
losses=losses)
print(losses)
print("Save to", model_dir)
nlp.to_disk(model_dir)
print("Load from", model_dir)
nlp = spacy.lang.en.English(pipeline=['tensorizer', 'ner'])
nlp.from_disk(model_dir)
for raw_text, _ in train_data:
doc = nlp(raw_text)
for word in doc:
print(word.text, word.ent_type_, word.ent_iob_)
if __name__ == '__main__':
import plac
plac.call(main)
# Who "" 2
# is "" 2
# Shaka "" PERSON 3
# Khan "" PERSON 1
# ? "" 2

View File

@ -1,206 +0,0 @@
#!/usr/bin/env python
'''Example of training a named entity recognition system from scratch using spaCy
This example is written to be self-contained and reasonably transparent.
To achieve that, it duplicates some of spaCy's internal functionality.
Specifically, in this example, we don't use spaCy's built-in Language class to
wire together the Vocab, Tokenizer and EntityRecognizer. Instead, we write
our own simple Pipeline class, so that it's easier to see how the pieces
interact.
Input data:
https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/data/GermEval2014_complete_data.zip
Developed for: spaCy 1.7.1
Last tested for: spaCy 2.0.0a13
'''
from __future__ import unicode_literals, print_function
import plac
from pathlib import Path
import random
import json
import tqdm
from thinc.neural.optimizers import Adam
from thinc.neural.ops import NumpyOps
from spacy.vocab import Vocab
from spacy.pipeline import TokenVectorEncoder, NeuralEntityRecognizer
from spacy.tokenizer import Tokenizer
from spacy.tokens import Doc
from spacy.attrs import *
from spacy.gold import GoldParse
from spacy.gold import iob_to_biluo
from spacy.gold import minibatch
from spacy.scorer import Scorer
import spacy.util
try:
unicode
except NameError:
unicode = str
spacy.util.set_env_log(True)
def init_vocab():
return Vocab(
lex_attr_getters={
LOWER: lambda string: string.lower(),
NORM: lambda string: string.lower(),
PREFIX: lambda string: string[0],
SUFFIX: lambda string: string[-3:],
})
class Pipeline(object):
def __init__(self, vocab=None, tokenizer=None, entity=None):
if vocab is None:
vocab = init_vocab()
if tokenizer is None:
tokenizer = Tokenizer(vocab, {}, None, None, None)
if entity is None:
entity = NeuralEntityRecognizer(vocab)
self.vocab = vocab
self.tokenizer = tokenizer
self.entity = entity
self.pipeline = [self.entity]
def begin_training(self):
for model in self.pipeline:
model.begin_training([])
optimizer = Adam(NumpyOps(), 0.001)
return optimizer
def __call__(self, input_):
doc = self.make_doc(input_)
for process in self.pipeline:
process(doc)
return doc
def make_doc(self, input_):
if isinstance(input_, bytes):
input_ = input_.decode('utf8')
if isinstance(input_, unicode):
return self.tokenizer(input_)
else:
return Doc(self.vocab, words=input_)
def make_gold(self, input_, annotations):
doc = self.make_doc(input_)
gold = GoldParse(doc, entities=annotations)
return gold
def update(self, inputs, annots, sgd, losses=None, drop=0.):
if losses is None:
losses = {}
docs = [self.make_doc(input_) for input_ in inputs]
golds = [self.make_gold(input_, annot) for input_, annot in
zip(inputs, annots)]
self.entity.update(docs, golds, drop=drop,
sgd=sgd, losses=losses)
return losses
def evaluate(self, examples):
scorer = Scorer()
for input_, annot in examples:
gold = self.make_gold(input_, annot)
doc = self(input_)
scorer.score(doc, gold)
return scorer.scores
def to_disk(self, path):
path = Path(path)
if not path.exists():
path.mkdir()
elif not path.is_dir():
raise IOError("Can't save pipeline to %s\nNot a directory" % path)
self.vocab.to_disk(path / 'vocab')
self.entity.to_disk(path / 'ner')
def from_disk(self, path):
path = Path(path)
if not path.exists():
raise IOError("Cannot load pipeline from %s\nDoes not exist" % path)
if not path.is_dir():
raise IOError("Cannot load pipeline from %s\nNot a directory" % path)
self.vocab = self.vocab.from_disk(path / 'vocab')
self.entity = self.entity.from_disk(path / 'ner')
def train(nlp, train_examples, dev_examples, nr_epoch=5):
sgd = nlp.begin_training()
print("Iter", "Loss", "P", "R", "F")
for i in range(nr_epoch):
random.shuffle(train_examples)
losses = {}
for batch in minibatch(tqdm.tqdm(train_examples, leave=False), size=8):
inputs, annots = zip(*batch)
nlp.update(list(inputs), list(annots), sgd, losses=losses)
scores = nlp.evaluate(dev_examples)
report_scores(i+1, losses['ner'], scores)
def report_scores(i, loss, scores):
precision = '%.2f' % scores['ents_p']
recall = '%.2f' % scores['ents_r']
f_measure = '%.2f' % scores['ents_f']
print('Epoch %d: %d %s %s %s' % (
i, int(loss), precision, recall, f_measure))
def read_examples(path):
path = Path(path)
with path.open() as file_:
sents = file_.read().strip().split('\n\n')
for sent in sents:
sent = sent.strip()
if not sent:
continue
tokens = sent.split('\n')
while tokens and tokens[0].startswith('#'):
tokens.pop(0)
words = []
iob = []
for token in tokens:
if token.strip():
pieces = token.split('\t')
words.append(pieces[1])
iob.append(pieces[2])
yield words, iob_to_biluo(iob)
def get_labels(examples):
labels = set()
for words, tags in examples:
for tag in tags:
if '-' in tag:
labels.add(tag.split('-')[1])
return sorted(labels)
@plac.annotations(
model_dir=("Path to save the model", "positional", None, Path),
train_loc=("Path to your training data", "positional", None, Path),
dev_loc=("Path to your development data", "positional", None, Path),
)
def main(model_dir, train_loc, dev_loc, nr_epoch=30):
print(model_dir, train_loc, dev_loc)
train_examples = list(read_examples(train_loc))
dev_examples = read_examples(dev_loc)
nlp = Pipeline()
for label in get_labels(train_examples):
nlp.entity.add_label(label)
print("Add label", label)
train(nlp, train_examples, list(dev_examples), nr_epoch)
nlp.to_disk(model_dir)
if __name__ == '__main__':
plac.call(main)

View File

@ -21,104 +21,120 @@ After training your model, you can save it to a directory. We recommend
wrapping models as Python packages, for ease of deployment.
For more details, see the documentation:
* Training the Named Entity Recognizer: https://spacy.io/docs/usage/train-ner
* Saving and loading models: https://spacy.io/docs/usage/saving-loading
* Training: https://alpha.spacy.io/usage/training
* NER: https://alpha.spacy.io/usage/linguistic-features#named-entities
Developed for: spaCy 1.7.6
Last updated for: spaCy 2.0.0a13
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import random
import spacy
from spacy.gold import GoldParse, minibatch
from spacy.pipeline import NeuralEntityRecognizer
from spacy.pipeline import TokenVectorEncoder
# new entity label
LABEL = 'ANIMAL'
# training data
TRAIN_DATA = [
("Horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')]),
("Do they bite?", []),
("horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')]),
("horses pretend to care about your feelings", [(0, 6, 'ANIMAL')]),
("they pretend to care about your feelings, those horses",
[(48, 54, 'ANIMAL')]),
("horses?", [(0, 6, 'ANIMAL')])
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
new_model_name=("New model name for model meta.", "option", "nm", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, new_model_name='animal', output_dir=None, n_iter=50):
"""Set up the pipeline and entity recognizer, and train the new entity."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# otherwise, get it, so we can add labels to it
else:
ner = nlp.get_pipe('ner')
ner.add_label(LABEL) # add new entity label to entity recognizer
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
random.seed(0)
optimizer = nlp.begin_training(lambda: [])
for itn in range(n_iter):
losses = {}
gold_parses = get_gold_parses(nlp.make_doc, TRAIN_DATA)
for batch in minibatch(gold_parses, size=3):
docs, golds = zip(*batch)
nlp.update(docs, golds, losses=losses, sgd=optimizer,
drop=0.35)
print(losses)
# test the trained model
test_text = 'Do you like horses?'
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
print(ent.label_, ent.text)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.meta['name'] = new_model_name # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
for ent in doc2.ents:
print(ent.label_, ent.text)
def get_gold_parses(tokenizer, train_data):
'''Shuffle and create GoldParse objects'''
"""Shuffle and create GoldParse objects.
tokenizer (Tokenizer): Tokenizer to processs the raw text.
train_data (list): The training data.
YIELDS (tuple): (doc, gold) tuples.
"""
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = tokenizer(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
yield doc, gold
def train_ner(nlp, train_data, output_dir):
random.seed(0)
optimizer = nlp.begin_training(lambda: [])
nlp.meta['name'] = 'en_ent_animal'
for itn in range(50):
losses = {}
for batch in minibatch(get_gold_parses(nlp.make_doc, train_data), size=3):
docs, golds = zip(*batch)
nlp.update(docs, golds, losses=losses, sgd=optimizer, drop=0.35)
print(losses)
if not output_dir:
return
elif not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
def main(model_name, output_directory=None):
print("Creating initial model", model_name)
nlp = spacy.blank(model_name)
if output_directory is not None:
output_directory = Path(output_directory)
train_data = [
(
"Horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')],
),
(
"Do they bite?",
[],
),
(
"horses are too tall and they pretend to care about your feelings",
[(0, 6, 'ANIMAL')]
),
(
"horses pretend to care about your feelings",
[(0, 6, 'ANIMAL')]
),
(
"they pretend to care about your feelings, those horses",
[(48, 54, 'ANIMAL')]
),
(
"horses?",
[(0, 6, 'ANIMAL')]
)
]
nlp.add_pipe(TokenVectorEncoder(nlp.vocab))
ner = NeuralEntityRecognizer(nlp.vocab)
ner.add_label('ANIMAL')
nlp.add_pipe(ner)
train_ner(nlp, train_data, output_directory)
# Test that the entity is recognized
text = 'Do you like horses?'
print("Ents in 'Do you like horses?':")
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent.text)
if output_directory:
print("Loading from", output_directory)
nlp2 = spacy.load(output_directory)
doc2 = nlp2('Do you like horses?')
for ent in doc2.ents:
print(ent.label_, ent.text)
if __name__ == '__main__':
import plac
plac.call(main)

View File

@ -1,75 +1,109 @@
#!/usr/bin/env python
# coding: utf8
"""
Example of training spaCy dependency parser, starting off with an existing model
or a blank model.
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* Dependency Parse: https://alpha.spacy.io/usage/linguistic-features#dependency-parse
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import json
import pathlib
import plac
import random
from pathlib import Path
import spacy
from spacy.pipeline import DependencyParser
from spacy.gold import GoldParse
from spacy.tokens import Doc
def train_parser(nlp, train_data, left_labels, right_labels):
parser = DependencyParser(
nlp.vocab,
left_labels=left_labels,
right_labels=right_labels)
for itn in range(1000):
random.shuffle(train_data)
loss = 0
for words, heads, deps in train_data:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
loss += parser.update(doc, gold)
parser.model.end_training()
return parser
# training data
TRAIN_DATA = [
(
['They', 'trade', 'mortgage', '-', 'backed', 'securities', '.'],
[1, 1, 4, 4, 5, 1, 1],
['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
),
(
['I', 'like', 'London', 'and', 'Berlin', '.'],
[1, 1, 1, 2, 2, 1],
['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
)
]
def main(model_dir=None):
if model_dir is not None:
model_dir = pathlib.Path(model_dir)
if not model_dir.exists():
model_dir.mkdir()
assert model_dir.is_dir()
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=1000):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
nlp = spacy.load('en', tagger=False, parser=False, entity=False, add_vectors=False)
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'parser' not in nlp.pipe_names:
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
parser = nlp.get_pipe('parser')
train_data = [
(
['They', 'trade', 'mortgage', '-', 'backed', 'securities', '.'],
[1, 1, 4, 4, 5, 1, 1],
['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
),
(
['I', 'like', 'London', 'and', 'Berlin', '.'],
[1, 1, 1, 2, 2, 1],
['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
)
]
left_labels = set()
right_labels = set()
for _, heads, deps in train_data:
for i, (head, dep) in enumerate(zip(heads, deps)):
if i < head:
left_labels.add(dep)
elif i > head:
right_labels.add(dep)
parser = train_parser(nlp, train_data, sorted(left_labels), sorted(right_labels))
# add labels to the parser
for _, _, deps in TRAIN_DATA:
for dep in deps:
parser.add_label(dep)
doc = Doc(nlp.vocab, words=['I', 'like', 'securities', '.'])
parser(doc)
for word in doc:
print(word.text, word.dep_, word.head.text)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training(lambda: [])
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for words, heads, deps in TRAIN_DATA:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, heads=heads, deps=deps)
nlp.update([doc], [gold], sgd=optimizer, losses=losses)
print(losses)
if model_dir is not None:
with (model_dir / 'config.json').open('w') as file_:
json.dump(parser.cfg, file_)
parser.model.dump(str(model_dir / 'model'))
# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
if __name__ == '__main__':
main()
# I nsubj like
# like ROOT like
# securities dobj like
# . cc securities
plac.call(main)
# expected result:
# [
# ('I', 'nsubj', 'like'),
# ('like', 'ROOT', 'like'),
# ('securities', 'dobj', 'like'),
# ('.', 'punct', 'like')
# ]

View File

@ -1,18 +1,28 @@
"""A quick example for training a part-of-speech tagger, without worrying
about the tokenization, or other language-specific customizations."""
#!/usr/bin/env python
# coding: utf8
"""
A simple example for training a part-of-speech tagger with a custom tag map.
To allow us to update the tag map with our custom one, this example starts off
with a blank Language class and modifies its defaults.
from __future__ import unicode_literals
from __future__ import print_function
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* POS Tagging: https://alpha.spacy.io/usage/linguistic-features#pos-tagging
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
from spacy.vocab import Vocab
from spacy.tagger import Tagger
import spacy
from spacy.util import get_lang_class
from spacy.tokens import Doc
from spacy.gold import GoldParse
import random
# You need to define a mapping from your data's part-of-speech tag names to the
# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
@ -28,54 +38,67 @@ TAG_MAP = {
# Usually you'll read this in, of course. Data formats vary.
# Ensure your strings are unicode.
DATA = [
(
["I", "like", "green", "eggs"],
["N", "V", "J", "N"]
),
(
["Eat", "blue", "ham"],
["V", "J", "N"]
)
TRAIN_DATA = [
(["I", "like", "green", "eggs"], ["N", "V", "J", "N"]),
(["Eat", "blue", "ham"], ["V", "J", "N"])
]
def ensure_dir(path):
if not path.exists():
path.mkdir()
@plac.annotations(
lang=("ISO Code of language to use", "option", "l", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(lang='en', output_dir=None, n_iter=25):
"""Create a new model, set up the pipeline and train the tagger. In order to
train the tagger with a custom tag map, we're creating a new Language
instance with a custom vocab.
"""
lang_cls = get_lang_class(lang) # get Language class
lang_cls.Defaults.tag_map.update(TAG_MAP) # add tag map to defaults
nlp = lang_cls() # initialise Language class
# add the tagger to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
tagger = nlp.create_pipe('tagger')
nlp.add_pipe(tagger)
def main(output_dir=None):
optimizer = nlp.begin_training(lambda: [])
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for words, tags in TRAIN_DATA:
doc = Doc(nlp.vocab, words=words)
gold = GoldParse(doc, tags=tags)
nlp.update([doc], [gold], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_text = "I like blue eggs"
doc = nlp(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
ensure_dir(output_dir)
ensure_dir(output_dir / "pos")
ensure_dir(output_dir / "vocab")
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
vocab = Vocab(tag_map=TAG_MAP)
# The default_templates argument is where features are specified. See
# spacy/tagger.pyx for the defaults.
tagger = Tagger(vocab)
for i in range(25):
for words, tags in DATA:
doc = Doc(vocab, words=words)
gold = GoldParse(doc, tags=tags)
tagger.update(doc, gold)
random.shuffle(DATA)
tagger.model.end_training()
doc = Doc(vocab, orths_and_spaces=zip(["I", "like", "blue", "eggs"], [True] * 4))
tagger(doc)
for word in doc:
print(word.text, word.tag_, word.pos_)
if output_dir is not None:
tagger.model.dump(str(output_dir / 'pos' / 'model'))
with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
tagger.vocab.strings.dump(file_)
# test the save model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
if __name__ == '__main__':
plac.call(main)
# I V VERB
# like V VERB
# blue N NOUN
# eggs N NOUN
# Expected output:
# [
# ('I', 'N', 'NOUN'),
# ('like', 'V', 'VERB'),
# ('blue', 'J', 'ADJ'),
# ('eggs', 'N', 'NOUN')
# ]

View File

@ -1,58 +1,119 @@
'''Train a multi-label convolutional neural network text classifier,
using the spacy.pipeline.TextCategorizer component. The model is then added
to spacy.pipeline, and predictions are available at `doc.cats`.
'''
from __future__ import unicode_literals
#!/usr/bin/env python
# coding: utf8
"""Train a multi-label convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`.
For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* Text classification: https://alpha.spacy.io/usage/text-classification
Developed for: spaCy 2.0.0a18
Last updated for: spaCy 2.0.0a18
"""
from __future__ import unicode_literals, print_function
import plac
import random
import tqdm
from thinc.neural.optimizers import Adam
from thinc.neural.ops import NumpyOps
from pathlib import Path
import thinc.extra.datasets
import spacy.lang.en
import spacy
from spacy.gold import GoldParse, minibatch
from spacy.util import compounding
from spacy.pipeline import TextCategorizer
# TODO: Remove this once we're not supporting models trained with thinc <6.9.0
import thinc.neural._classes.layernorm
thinc.neural._classes.layernorm.set_compat_six_eight(False)
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=20):
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
def train_textcat(tokenizer, textcat,
train_texts, train_cats, dev_texts, dev_cats,
n_iter=20):
'''
Train the TextCategorizer without associated pipeline.
'''
textcat.begin_training()
optimizer = Adam(NumpyOps(), 0.001)
train_docs = [tokenizer(text) for text in train_texts]
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'textcat' not in nlp.pipe_names:
# textcat = nlp.create_pipe('textcat')
textcat = TextCategorizer(nlp.vocab, labels=['POSITIVE'])
nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
else:
textcat = nlp.get_pipe('textcat')
# add label to text classifier
# textcat.add_label('POSITIVE')
# load the IMBD dataset
print("Loading IMDB data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=2000)
train_docs = [nlp.tokenizer(text) for text in train_texts]
train_gold = [GoldParse(doc, cats=cats) for doc, cats in
zip(train_docs, train_cats)]
train_data = list(zip(train_docs, train_gold))
batch_sizes = compounding(4., 128., 1.001)
for i in range(n_iter):
losses = {}
# Progress bar and minibatching
batches = minibatch(tqdm.tqdm(train_data, leave=False), size=batch_sizes)
for batch in batches:
docs, golds = zip(*batch)
textcat.update(docs, golds, sgd=optimizer, drop=0.2,
losses=losses)
with textcat.model.use_params(optimizer.averages):
scores = evaluate(tokenizer, textcat, dev_texts, dev_cats)
yield losses['textcat'], scores
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training(lambda: [])
print("Training the model...")
print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
for i in range(n_iter):
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4., 128., 1.001))
for batch in batches:
docs, golds = zip(*batch)
nlp.update(docs, golds, sgd=optimizer, drop=0.2, losses=losses)
with textcat.model.use_params(optimizer.averages):
# evaluate on the dev data split off in load_data()
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
print('{0:.3f}\t{0:.3f}\t{0:.3f}\t{0:.3f}' # print a simple table
.format(losses['textcat'], scores['textcat_p'],
scores['textcat_r'], scores['textcat_f']))
# test the trained model
test_text = "This movie sucked"
doc = nlp(test_text)
print(test_text, doc.cats)
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
print(test_text, doc2.cats)
def load_data(limit=0, split=0.8):
"""Load data from the IMDB dataset."""
# Partition off part of the train data for evaluation
train_data, _ = thinc.extra.datasets.imdb()
random.shuffle(train_data)
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
cats = [{'POSITIVE': bool(y)} for y in labels]
split = int(len(train_data) * split)
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
def evaluate(tokenizer, textcat, texts, cats):
docs = (tokenizer(text) for text in texts)
tp = 1e-8 # True positives
fp = 1e-8 # False positives
fn = 1e-8 # False negatives
tn = 1e-8 # True negatives
tp = 1e-8 # True positives
fp = 1e-8 # False positives
fn = 1e-8 # False negatives
tn = 1e-8 # True negatives
for i, doc in enumerate(textcat.pipe(docs)):
gold = cats[i]
for label, score in doc.cats.items():
@ -66,55 +127,10 @@ def evaluate(tokenizer, textcat, texts, cats):
tn += 1
elif score < 0.5 and gold[label] >= 0.5:
fn += 1
precis = tp / (tp + fp)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
fscore = 2 * (precis * recall) / (precis + recall)
return {'textcat_p': precis, 'textcat_r': recall, 'textcat_f': fscore}
def load_data(limit=0):
# Partition off part of the train data --- avoid running experiments
# against test.
train_data, _ = thinc.extra.datasets.imdb()
random.shuffle(train_data)
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
cats = [{'POSITIVE': bool(y)} for y in labels]
split = int(len(train_data) * 0.8)
train_texts = texts[:split]
train_cats = cats[:split]
dev_texts = texts[split:]
dev_cats = cats[split:]
return (train_texts, train_cats), (dev_texts, dev_cats)
def main(model_loc=None):
nlp = spacy.lang.en.English()
tokenizer = nlp.tokenizer
textcat = TextCategorizer(tokenizer.vocab, labels=['POSITIVE'])
print("Load IMDB data")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=2000)
print("Itn.\tLoss\tP\tR\tF")
progress = '{i:d} {loss:.3f} {textcat_p:.3f} {textcat_r:.3f} {textcat_f:.3f}'
for i, (loss, scores) in enumerate(train_textcat(tokenizer, textcat,
train_texts, train_cats,
dev_texts, dev_cats, n_iter=20)):
print(progress.format(i=i, loss=loss, **scores))
# How to save, load and use
nlp.pipeline.append(textcat)
if model_loc is not None:
nlp.to_disk(model_loc)
nlp = spacy.load(model_loc)
doc = nlp(u'This movie sucked!')
print(doc.cats)
f_score = 2 * (precision * recall) / (precision + recall)
return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}
if __name__ == '__main__':

View File

@ -1,36 +0,0 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
import plac
import codecs
import pathlib
import random
import twython
import spacy.en
import _handler
class Connection(twython.TwythonStreamer):
def __init__(self, keys_dir, nlp, query):
keys_dir = pathlib.Path(keys_dir)
read = lambda fn: (keys_dir / (fn + '.txt')).open().read().strip()
api_key = map(read, ['key', 'secret', 'token', 'token_secret'])
twython.TwythonStreamer.__init__(self, *api_key)
self.nlp = nlp
self.query = query
def on_success(self, data):
_handler.handle_tweet(self.nlp, data, self.query)
if random.random() >= 0.1:
reload(_handler)
def main(keys_dir, term):
nlp = spacy.en.English()
twitter = Connection(keys_dir, nlp, term)
twitter.statuses.filter(track=term, language='en')
if __name__ == '__main__':
plac.call(main)

View File

@ -1,16 +1,19 @@
'''Load vectors for a language trained using FastText
#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
'''
"""
from __future__ import unicode_literals
import plac
import numpy
import spacy.language
import from spacy.language import Language
@plac.annotations(
vectors_loc=("Path to vectors", "positional", None, str))
def main(vectors_loc):
nlp = spacy.language.Language()
nlp = Language()
with open(vectors_loc, 'rb') as file_:
header = file_.readline()
@ -18,7 +21,7 @@ def main(vectors_loc):
nlp.vocab.clear_vectors(int(nr_dim))
for line in file_:
line = line.decode('utf8')
pieces = line.split()
pieces = line.split()
word = pieces[0]
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector)

View File

@ -94,7 +94,6 @@ def _zero_init(model):
@layerize
def _preprocess_doc(docs, drop=0.):
keys = [doc.to_array([LOWER]) for doc in docs]
keys = [a[:, 0] for a in keys]
ops = Model.ops
lengths = ops.asarray([arr.shape[0] for arr in keys])
keys = ops.xp.concatenate(keys)
@ -521,7 +520,6 @@ def zero_init(model):
@layerize
def preprocess_doc(docs, drop=0.):
keys = [doc.to_array([LOWER]) for doc in docs]
keys = [a[:, 0] for a in keys]
ops = Model.ops
lengths = ops.asarray([arr.shape[0] for arr in keys])
keys = ops.xp.concatenate(keys)

View File

@ -1,6 +1,7 @@
# coding: utf8
from __future__ import absolute_import, unicode_literals
from contextlib import contextmanager
import copy
from thinc.neural import Model
from thinc.neural.optimizers import Adam
@ -146,7 +147,7 @@ class Language(object):
@property
def meta(self):
self._meta.setdefault('lang', self.vocab.lang)
self._meta.setdefault('name', '')
self._meta.setdefault('name', 'model')
self._meta.setdefault('version', '0.0.0')
self._meta.setdefault('spacy_version', about.__version__)
self._meta.setdefault('description', '')
@ -330,6 +331,29 @@ class Language(object):
doc = proc(doc)
return doc
def disable_pipes(self, *names):
'''Disable one or more pipeline components.
If used as a context manager, the pipeline will be restored to the initial
state at the end of the block. Otherwise, a DisabledPipes object is
returned, that has a `.restore()` method you can use to undo your
changes.
EXAMPLE:
>>> nlp.add_pipe('parser')
>>> nlp.add_pipe('tagger')
>>> with nlp.disable_pipes('parser', 'tagger'):
>>> assert not nlp.has_pipe('parser')
>>> assert nlp.has_pipe('parser')
>>> disabled = nlp.disable_pipes('parser')
>>> assert len(disabled) == 1
>>> assert not nlp.has_pipe('parser')
>>> disabled.restore()
>>> assert nlp.has_pipe('parser')
'''
return DisabledPipes(self, *names)
def make_doc(self, text):
return self.tokenizer(text)
@ -657,6 +681,42 @@ class Language(object):
return self
class DisabledPipes(list):
'''Manager for temporary pipeline disabling.'''
def __init__(self, nlp, *names):
self.nlp = nlp
self.names = names
# Important! Not deep copy -- we just want the container (but we also
# want to support people providing arbitrarily typed nlp.pipeline
# objects.)
self.original_pipeline = copy.copy(nlp.pipeline)
list.__init__(self)
self.extend(nlp.remove_pipe(name) for name in names)
def __enter__(self):
return self
def __exit__(self, *args):
self.restore()
def restore(self):
'''Restore the pipeline to its state when DisabledPipes was created.'''
current, self.nlp.pipeline = self.nlp.pipeline, self.original_pipeline
unexpected = [name for name, pipe in current if not self.nlp.has_pipe(name)]
if unexpected:
# Don't change the pipeline if we're raising an error.
self.nlp.pipeline = current
msg = (
"Some current components would be lost when restoring "
"previous pipeline state. If you added components after "
"calling nlp.disable_pipes(), you should remove them "
"explicitly with nlp.remove_pipe() before the pipeline is "
"restore. Names of the new components: %s"
)
raise ValueError(msg % unexpected)
self[:] = []
def unpickle_language(vocab, meta, bytes_data):
lang = Language(vocab=vocab)
lang.from_bytes(bytes_data)

View File

@ -417,8 +417,6 @@ class Tagger(Pipe):
new_tag_map[tag] = orig_tag_map[tag]
else:
new_tag_map[tag] = {POS: X}
if 'SP' not in new_tag_map:
new_tag_map['SP'] = orig_tag_map.get('SP', {POS: X})
cdef Vocab vocab = self.vocab
if new_tag_map:
vocab.morphology = Morphology(vocab.strings, new_tag_map,

View File

@ -82,3 +82,21 @@ def test_remove_pipe(nlp, name):
assert not len(nlp.pipeline)
assert removed_name == name
assert removed_component == new_pipe
@pytest.mark.parametrize('name', ['my_component'])
def test_disable_pipes_method(nlp, name):
nlp.add_pipe(new_pipe, name=name)
assert nlp.has_pipe(name)
disabled = nlp.disable_pipes(name)
assert not nlp.has_pipe(name)
disabled.restore()
@pytest.mark.parametrize('name', ['my_component'])
def test_disable_pipes_context(nlp, name):
nlp.add_pipe(new_pipe, name=name)
assert nlp.has_pipe(name)
with nlp.disable_pipes(name):
assert not nlp.has_pipe(name)
assert nlp.has_pipe(name)

View File

@ -13,7 +13,9 @@ p
| that are part of an entity are set to the entity label, prefixed by the
| BILUO marker. For example #[code "B-ORG"] describes the first token of
| a multi-token #[code ORG] entity and #[code "U-PERSON"] a single
| token representing a #[code PERSON] entity
| token representing a #[code PERSON] entity. The
| #[+api("goldparse#biluo_tags_from_offsets") #[code biluo_tags_from_offsets]]
| function can help you convert entity offsets to the right format.
+code("Example structure").
[{

View File

@ -441,6 +441,37 @@ p
+cell tuple
+cell A #[code (name, component)] tuple of the removed component.
+h(2, "disable_pipes") Language.disable_pipes
+tag contextmanager
+tag-new(2)
p
| Disable one or more pipeline components. If used as a context manager,
| the pipeline will be restored to the initial state at the end of the
| block. Otherwise, a #[code DisabledPipes] object is returned, that has a
| #[code .restore()] method you can use to undo your changes.
+aside-code("Example").
with nlp.disable_pipes('tagger', 'parser'):
optimizer = nlp.begin_training(gold_tuples)
disabled = nlp.disable_pipes('tagger', 'parser')
optimizer = nlp.begin_training(gold_tuples)
disabled.restore()
+table(["Name", "Type", "Description"])
+row
+cell #[code *disabled]
+cell unicode
+cell Names of pipeline components to disable.
+row("foot")
+cell returns
+cell #[code DisabledPipes]
+cell
| The disabled pipes that can be restored by calling the object's
| #[code .restore()] method.
+h(2, "to_disk") Language.to_disk
+tag method
+tag-new(2)

View File

@ -304,6 +304,21 @@ p Modify the pipe's model, to use the given parameter values.
| The parameter values to use in the model. At the end of the
| context, the original parameters are restored.
+h(2, "add_label") #{CLASSNAME}.add_label
+tag method
p Add a new label to the pipe.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
#{VARNAME}.add_label('MY_LABEL')
+table(["Name", "Type", "Description"])
+row
+cell #[code label]
+cell unicode
+cell The label to add.
+h(2, "to_disk") #{CLASSNAME}.to_disk
+tag method

View File

@ -106,7 +106,7 @@
"How Pipelines Work": "pipelines",
"Custom Components": "custom-components",
"Developing Extensions": "extensions",
"Multi-threading": "multithreading",
"Multi-Threading": "multithreading",
"Serialization": "serialization"
}
},
@ -196,9 +196,10 @@
"teaser": "Full code examples you can modify and run.",
"next": "resources",
"menu": {
"Information Extraction": "information-extraction",
"Pipeline": "pipeline",
"Matching": "matching",
"Training": "training",
"Vectors & Similarity": "vectors",
"Deep Learning": "deep-learning"
}
}

View File

@ -38,3 +38,16 @@ p
| the generator in two, and then #[code izip] the extra stream to the
| document stream. Here's
| #[+a(gh("spacy") + "/issues/172#issuecomment-183963403") an example].
+h(3, "multi-processing-example") Example: Multi-processing with Joblib
p
| This example shows how to use multiple cores to process text using
| spaCy and #[+a("https://pythonhosted.org/joblib/") Joblib]. We're
| exporting part-of-speech-tagged, true-cased, (very roughly)
| sentence-separated text, with each "sentence" on a newline, and
| spaces between tokens. Data is loaded from the IMDB movie reviews
| dataset and will be loaded automatically via Thinc's built-in dataset
| loader.
+github("spacy", "examples/pipeline/multi_processing.py")

View File

@ -24,38 +24,108 @@ p
| #[strong experiment on your own data] to find a solution that works best
| for you.
+h(3, "example-new-entity-type") Example: Training an additional entity type
+h(3, "example-train-ner") Updating the Named Entity Recognizer
p
| This script shows how to add a new entity type to an existing pre-trained
| NER model. To keep the example short and simple, only a few sentences are
| This example shows how to update spaCy's entity recognizer
| with your own examples, starting off with an existing, pre-trained
| model, or from scratch using a blank #[code Language] class. To do
| this, you'll need #[strong example texts] and the
| #[strong character offsets] and #[strong labels] of each entity contained
| in the texts.
+github("spacy", "examples/training/train_ner.py")
+h(4) Step by step guide
+list("numbers")
+item
| #[strong Reformat the training data] to match spaCy's
| #[+a("/api/annotation#json-input") JSON format]. The built-in
| #[+api("goldparse#biluo_tags_from_offsets") #[code biluo_tags_from_offsets]]
| function can help you with this.
+item
| #[strong Load the model] you want to start with, or create an
| #[strong empty model] using
| #[+api("spacy#blank") #[code spacy.blank]] with the ID of your
| language. If you're using a blank model, don't forget to add the
| entity recognizer to the pipeline. If you're using an existing model,
| make sure to disable all other pipeline components during training
| using #[+api("language#disable_pipes") #[code nlp.disable_pipes]].
| This way, you'll only be training the entity recognizer.
+item
| #[strong Shuffle and loop over] the examples and create a
| #[code Doc] and #[code GoldParse] object for each example.
+item
| For each example, #[strong update the model]
| by calling #[+api("language#update") #[code nlp.update]], which steps
| through the words of the input. At each word, it makes a
| #[strong prediction]. It then consults the annotations provided on the
| #[code GoldParse] instance, to see whether it was
| right. If it was wrong, it adjusts its weights so that the correct
| action will score higher next time.
+item
| #[strong Save] the trained model using
| #[+api("language#to_disk") #[code nlp.to_disk]].
+item
| #[strong Test] the model to make sure the entities in the training
| data are recognised correctly.
+h(3, "example-new-entity-type") Training an additional entity type
p
| This script shows how to add a new entity type #[code ANIMAL] to an
| existing pre-trained NER model, or an empty #[code Language] class. To
| keep the example short and simple, only a few sentences are
| provided as examples. In practice, you'll need many more — a few hundred
| would be a good start. You will also likely need to mix in examples of
| other entity types, which might be obtained by running the entity
| recognizer over unlabelled sentences, and adding their annotations to the
| training set.
p
| The actual training is performed by looping over the examples, and
| calling #[+api("language#update") #[code nlp.update()]]. The
| #[code update] method steps through the words of the input. At each word,
| it makes a prediction. It then consults the annotations provided on the
| #[+api("goldparse") #[code GoldParse]] instance, to see whether it was
| right. If it was wrong, it adjusts its weights so that the correct
| action will score higher next time.
+github("spacy", "examples/training/train_new_entity_type.py")
+h(3, "example-ner-from-scratch") Example: Training an NER system from scratch
+h(4) Step by step guide
p
| This example is written to be self-contained and reasonably transparent.
| To achieve that, it duplicates some of spaCy's internal functionality.
| Specifically, in this example, we don't use spaCy's built-in
| #[+api("language") #[code Language]] class to wire together the
| #[+api("vocab") #[code Vocab]], #[+api("tokenizer") #[code Tokenizer]]
| and #[+api("entityrecognizer") #[code EntityRecognizer]]. Instead, we
| write our own simle #[code Pipeline] class, so that it's easier to see
| how the pieces interact.
+list("numbers")
+item
| Create #[code Doc] and #[code GoldParse] objects for
| #[strong each example in your training data].
+github("spacy", "examples/training/train_ner_standalone.py")
+item
| #[strong Load the model] you want to start with, or create an
| #[strong empty model] using
| #[+api("spacy#blank") #[code spacy.blank]] with the ID of your
| language. If you're using a blank model, don't forget to add the
| entity recognizer to the pipeline. If you're using an existing model,
| make sure to disable all other pipeline components during training
| using #[+api("language#disable_pipes") #[code nlp.disable_pipes]].
| This way, you'll only be training the entity recognizer.
+item
| #[strong Add the new entity label] to the entity recognizer using the
| #[+api("entityrecognizer#add_label") #[code add_label]] method. You
| can access the entity recognizer in the pipeline via
| #[code nlp.get_pipe('ner')].
+item
| #[strong Loop over] the examples and call
| #[+api("language#update") #[code nlp.update]], which steps through
| the words of the input. At each word, it makes a
| #[strong prediction]. It then consults the annotations provided on the
| #[code GoldParse] instance, to see whether it was right. If it was
| wrong, it adjusts its weights so that the correct action will score
| higher next time.
+item
| #[strong Save] the trained model using
| #[+api("language#to_disk") #[code nlp.to_disk]].
+item
| #[strong Test] the model to make sure the new entity is recognised
| correctly.

View File

@ -1,6 +1,195 @@
//- 💫 DOCS > USAGE > TRAINING > TAGGER & PARSER
+under-construction
+h(3, "example-train-parser") Updating the Dependency Parser
p
| This example shows how to train spaCy's dependency parser, starting off
| with an existing model or a blank model. You'll need a set of
| #[strong training examples] and the respective #[strong heads] and
| #[strong dependency label] for each token of the example texts.
+github("spacy", "examples/training/train_parser.py")
+h(4) Step by step guide
+list("numbers")
+item
| #[strong Load the model] you want to start with, or create an
| #[strong empty model] using
| #[+api("spacy#blank") #[code spacy.blank]] with the ID of your
| language. If you're using a blank model, don't forget to add the
| parser to the pipeline. If you're using an existing model,
| make sure to disable all other pipeline components during training
| using #[+api("language#disable_pipes") #[code nlp.disable_pipes]].
| This way, you'll only be training the parser.
+item
| #[strong Add the dependency labels] to the parser using the
| #[+api("dependencyparser#add_label") #[code add_label]] method. If
| you're starting off with a pre-trained spaCy model, this is usually
| not necessary but it doesn't hurt either, just to be safe.
+item
| #[strong Shuffle and loop over] the examples and create a
| #[code Doc] and #[code GoldParse] object for each example. Make sure
| to pass in the #[code heads] and #[code deps] when you create the
| #[code GoldParse].
+item
| For each example, #[strong update the model]
| by calling #[+api("language#update") #[code nlp.update]], which steps
| through the words of the input. At each word, it makes a
| #[strong prediction]. It then consults the annotations provided on the
| #[code GoldParse] instance, to see whether it was
| right. If it was wrong, it adjusts its weights so that the correct
| action will score higher next time.
+item
| #[strong Save] the trained model using
| #[+api("language#to_disk") #[code nlp.to_disk]].
+item
| #[strong Test] the model to make sure the parser works as expected.
+h(3, "example-train-tagger") Updating the Part-of-speech Tagger
p
| In this example, we're training spaCy's part-of-speech tagger with a
| custom tag map. We start off with a blank #[code Language] class, update
| its defaults with our custom tags and then train the tagger. You'll need
| a set of #[strong training examples] and the respective
| #[strong custom tags], as well as a dictionary mapping those tags to the
| #[+a("http://universaldependencies.github.io/docs/u/pos/index.html") Universal Dependencies scheme].
+github("spacy", "examples/training/train_tagger.py")
+h(4) Step by step guide
+list("numbers")
+item
| #[strong Create] a new #[code Language] class and before initialising
| it, update the #[code tag_map] in its #[code Defaults] with your
| custom tags.
+item
| #[strong Create a new tagger] component and add it to the pipeline.
+item
| #[strong Shuffle and loop over] the examples and create a
| #[code Doc] and #[code GoldParse] object for each example. Make sure
| to pass in the #[code tags] when you create the #[code GoldParse].
+item
| For each example, #[strong update the model]
| by calling #[+api("language#update") #[code nlp.update]], which steps
| through the words of the input. At each word, it makes a
| #[strong prediction]. It then consults the annotations provided on the
| #[code GoldParse] instance, to see whether it was
| right. If it was wrong, it adjusts its weights so that the correct
| action will score higher next time.
+item
| #[strong Save] the trained model using
| #[+api("language#to_disk") #[code nlp.to_disk]].
+item
| #[strong Test] the model to make sure the parser works as expected.
+h(3, "intent-parser") Training a parser for custom semantics
p
| spaCy's parser component can be used to trained to predict any type
| of tree structure over your input text  including
| #[strong semantic relations] that are not syntactic dependencies. This
| can be useful to for #[strong conversational applications], which need to
| predict trees over whole documents or chat logs, with connections between
| the sentence roots used to annotate discourse structure. For example, you
| can train spaCy's parser to label intents and their targets, like
| attributes, quality, time and locations. The result could look like this:
+codepen("991f245ef90debb78c8fc369294f75ad", 300)
+code.
doc = nlp(u"find a hotel with good wifi")
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])
# [('find', 'ROOT', 'find'), ('hotel', 'PLACE', 'find'),
# ('good', 'QUALITY', 'wifi'), ('wifi', 'ATTRIBUTE', 'hotel')]
p
| The above tree attaches "wifi" to "hotel" and assigns the dependency
| label #[code ATTRIBUTE]. This may not be a correct syntactic dependency
| but in this case, it expresses exactly what we need: the user is looking
| for a hotel with the attribute "wifi" of the quality "good". This query
| can then be processed by your application and used to trigger the
| respective action e.g. search the database for hotels with high ratings
| for their wifi offerings.
+aside("Tip: merge phrases and entities")
| To achieve even better accuracy, try merging multi-word tokens and
| entities specific to your domain into one token before parsing your text.
| You can do this by running the entity recognizer or
| #[+a("/usage/linguistic-features#rule-based-matching") rule-based matcher]
| to find relevant spans, and merging them using
| #[+api("span#merge") #[code Span.merge]]. You could even add your own
| custom #[+a("/usage/processing-pipelines#custom-components") pipeline component]
| to do this automatically just make sure to add it #[code before='parser'].
p
| The following example example shows a full implementation of a training
| loop for a custom message parser for a common "chat intent": finding
| local businesses. Our message semantics will have the following types
| of relations: #[code ROOT], #[code PLACE], #[code QUALITY],
| #[code ATTRIBUTE], #[code TIME] and #[code LOCATION].
+github("spacy", "examples/training/train_intent_parser.py")
+h(4) Step by step guide
+list("numbers")
+item
| #[strong Create the training data] consisting of words, their heads
| and their dependency labels in order. A token's head is the index
| of the token it is attached to. The heads don't need to be
| syntactically correct they should express the
| #[strong semantic relations] you want the parser to learn. For words
| that shouldn't receive a label, you can choose an arbitrary
| placeholder, for example #[code -].
+item
| #[strong Load the model] you want to start with, or create an
| #[strong empty model] using
| #[+api("spacy#blank") #[code spacy.blank]] with the ID of your
| language. If you're using a blank model, don't forget to add the
| parser to the pipeline. If you're using an existing model,
| make sure to disable all other pipeline components during training
| using #[+api("language#disable_pipes") #[code nlp.disable_pipes]].
| This way, you'll only be training the parser.
+item
| #[strong Add the dependency labels] to the parser using the
| #[+api("dependencyparser#add_label") #[code add_label]] method.
+item
| #[strong Shuffle and loop over] the examples and create a
| #[code Doc] and #[code GoldParse] object for each example. Make sure
| to pass in the #[code heads] and #[code deps] when you create the
| #[code GoldParse].
+item
| For each example, #[strong update the model]
| by calling #[+api("language#update") #[code nlp.update]], which steps
| through the words of the input. At each word, it makes a
| #[strong prediction]. It then consults the annotations provided on the
| #[code GoldParse] instance, to see whether it was
| right. If it was wrong, it adjusts its weights so that the correct
| action will score higher next time.
+item
| #[strong Save] the trained model using
| #[+api("language#to_disk") #[code nlp.to_disk]].
+item
| #[strong Test] the model to make sure the parser works as expected.
+h(3, "training-json") JSON format for training

View File

@ -1,13 +1,62 @@
//- 💫 DOCS > USAGE > TRAINING > TEXT CLASSIFICATION
+under-construction
+h(3, "example-textcat") Example: Training spaCy's text classifier
+h(3, "example-textcat") Adding a text classifier to a spaCy model
+tag-new(2)
p
| This example shows how to use and train spaCy's new
| #[+api("textcategorizer") #[code TextCategorizer]] pipeline component
| on IMDB movie reviews.
| This example shows how to train a multi-label convolutional neural
| network text classifier on IMDB movie reviews, using spaCy's new
| #[+api("textcategorizer") #[code TextCategorizer]] component. The
| dataset will be loaded automatically via Thinc's built-in dataset
| loader. Predictions are available via
| #[+api("doc#attributes") #[code Doc.cats]].
+github("spacy", "examples/training/train_textcat.py")
+h(4) Step by step guide
+list("numbers")
+item
| #[strong Load the model] you want to start with, or create an
| #[strong empty model] using
| #[+api("spacy#blank") #[code spacy.blank]] with the ID of your
| language. If you're using an existing model, make sure to disable all
| other pipeline components during training using
| #[+api("language#disable_pipes") #[code nlp.disable_pipes]]. This
| way, you'll only be training the text classifier.
+item
| #[strong Add the text classifier] to the pipeline, and add the labels
| you want to train for example, #[code POSITIVE].
+item
| #[strong Load and pre-process the dataset], shuffle the data and
| split off a part of it to hold back for evaluation. This way, you'll
| be able to see results on each training iteration.
+item
| #[strong Loop over] the training examples, partition them into
| batches and create #[code Doc] and #[code GoldParse] objects for each
| example in the batch.
+item
| #[strong Update the model] by calling
| #[+api("language#update") #[code nlp.update]], which steps
| through the examples and makes a #[strong prediction]. It then
| consults the annotations provided on the #[code GoldParse] instance,
| to see whether it was right. If it was wrong, it adjusts its weights
| so that the correct prediction will score higher next time.
+item
| Optionally, you can also #[strong evaluate the text classifier] on
| each iteration, by checking how it performs on the development data
| held back from the dataset. This lets you print the
| #[strong precision], #[strong recall] and #[strong F-score].
+item
| #[strong Save] the trained model using
| #[+api("language#to_disk") #[code nlp.to_disk]].
+item
| #[strong Test] the model to make sure the text classifier works as
| expected.

View File

@ -2,6 +2,37 @@
include ../_includes/_mixins
+section("information-extraction")
+h(3, "phrase-matcher") Using spaCy's phrase matcher
+tag-new(2)
p
| This example shows how to use the new
| #[+api("phrasematcher") #[code PhraseMatcher]] to efficiently find
| entities from a large terminology list.
+github("spacy", "examples/information_extraction/phrase_matcher.py")
+h(3, "entity-relations") Extracting entity relations
p
| A simple example of extracting relations between phrases and
| entities using spaCy's named entity recognizer and the dependency
| parse. Here, we extract money and currency values (entities labelled
| as #[code MONEY]) and then check the dependency tree to find the
| noun phrase they are referring to for example: "$9.4 million"
| &rarr; "Net income".
+github("spacy", "examples/information_extraction/entity_relations.py")
+h(3, "subtrees") Navigating the parse tree and subtrees
p
| This example shows how to navigate the parse tree including subtrees
| attached to a word.
+github("spacy", "examples/information_extraction/parse_subtrees.py")
+section("pipeline")
+h(3, "custom-components-entities") Custom pipeline components and attribute extensions
+tag-new(2)
@ -40,27 +71,29 @@ include ../_includes/_mixins
+github("spacy", "examples/pipeline/custom_attr_methods.py")
+section("matching")
+h(3, "matcher") Using spaCy's rule-based matcher
+h(3, "multi-processing") Multi-processing with Joblib
p
| This example shows how to use spaCy's rule-based
| #[+api("matcher") #[code Matcher]] to find and label entities across
| documents.
| This example shows how to use multiple cores to process text using
| spaCy and #[+a("https://pythonhosted.org/joblib/") Joblib]. We're
| exporting part-of-speech-tagged, true-cased, (very roughly)
| sentence-separated text, with each "sentence" on a newline, and
| spaces between tokens. Data is loaded from the IMDB movie reviews
| dataset and will be loaded automatically via Thinc's built-in dataset
| loader.
+github("spacy", "examples/matcher_example.py")
+h(3, "phrase-matcher") Using spaCy's phrase matcher
+tag-new(2)
p
| This example shows how to use the new
| #[+api("phrasematcher") #[code PhraseMatcher]] to efficiently find
| entities from a large terminology list.
+github("spacy", "examples/phrase_matcher.py")
+github("spacy", "examples/pipeline/multi_processing.py")
+section("training")
+h(3, "training-ner") Training spaCy's Named Entity Recognizer
p
| This example shows how to update spaCy's entity recognizer
| with your own examples, starting off with an existing, pre-trained
| model, or from scratch using a blank #[code Language] class.
+github("spacy", "examples/training/train_ner.py")
+h(3, "new-entity-type") Training an additional entity type
p
@ -71,25 +104,63 @@ include ../_includes/_mixins
+github("spacy", "examples/training/train_new_entity_type.py")
+h(3, "ner-standalone") Training an NER system from scratch
+h(3, "parser") Training spaCy's Dependency Parser
p
| This example is written to be self-contained and reasonably
| transparent. To achieve that, it duplicates some of spaCy's internal
| functionality.
| This example shows how to update spaCy's dependency parser,
| starting off with an existing, pre-trained model, or from scratch
| using a blank #[code Language] class.
+github("spacy", "examples/training/train_ner_standalone.py")
+github("spacy", "examples/training/train_parser.py")
+h(3, "tagger") Training spaCy's Part-of-speech Tagger
p
| In this example, we're training spaCy's part-of-speech tagger with a
| custom tag map, mapping our own tags to the mapping those tags to the
| #[+a("http://universaldependencies.github.io/docs/u/pos/index.html") Universal Dependencies scheme].
+github("spacy", "examples/training/train_tagger.py")
+h(3, "intent-parser") Training a custom parser for chat intent semantics
p
| spaCy's parser component can be used to trained to predict any type
| of tree structure over your input text. You can also predict trees
| over whole documents or chat logs, with connections between the
| sentence-roots used to annotate discourse structure. In this example,
| we'll build a message parser for a common "chat intent": finding
| local businesses. Our message semantics will have the following types
| of relations: #[code ROOT], #[code PLACE], #[code QUALITY],
| #[code ATTRIBUTE], #[code TIME] and #[code LOCATION].
+github("spacy", "examples/training/train_intent_parser.py")
+h(3, "textcat") Training spaCy's text classifier
+tag-new(2)
p
| This example shows how to use and train spaCy's new
| #[+api("textcategorizer") #[code TextCategorizer]] pipeline component
| on IMDB movie reviews.
| This example shows how to train a multi-label convolutional neural
| network text classifier on IMDB movie reviews, using spaCy's new
| #[+api("textcategorizer") #[code TextCategorizer]] component. The
| dataset will be loaded automatically via Thinc's built-in dataset
| loader. Predictions are available via
| #[+api("doc#attributes") #[code Doc.cats]].
+github("spacy", "examples/training/train_textcat.py")
+section("vectors")
+h(3, "fasttext") Loading pre-trained fastText vectors
p
| This simple snippet is all you need to be able to use the Facebook's
| #[+a("https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md") fastText vectors]
| (294 languages, pre-trained on Wikipedia) with spaCy. Once they're
| loaded, the vectors will be available via spaCy's built-in
| #[code similarity()] methods.
+github("spacy", "examples/vectors_fast_text.py")
+section("deep-learning")
+h(3, "keras") Text classification with Keras

View File

@ -2,8 +2,4 @@
include ../_includes/_mixins
+under-construction
+h(2, "example") Example
+github("spacy", "examples/training/train_textcat.py")
include _training/_textcat