spaCy/website/docs/_api-language.jade

259 lines
12 KiB
Plaintext
Raw Normal View History

2016-10-03 21:19:13 +03:00
//- ----------------------------------
2016-10-21 01:58:24 +03:00
//- 💫 DOCS > API > LANGUAGE
2016-10-03 21:19:13 +03:00
//- ----------------------------------
2016-03-31 17:24:48 +03:00
2016-10-21 01:58:24 +03:00
+section("language")
+h(2, "language", "https://github.com/" + SOCIAL.github + "/spaCy/blob/master/spacy/language.py")
| #[+tag class] Language
2016-03-31 17:24:48 +03:00
p.
2016-10-21 01:58:24 +03:00
A pipeline that transforms text strings into annotated spaCy Doc objects. Usually you'll load the Language pipeline once and pass the instance around your program.
2016-10-03 21:19:13 +03:00
+code("python", "Overview").
2016-03-31 17:24:48 +03:00
class Language:
2016-10-21 01:58:24 +03:00
Defaults = BaseDefaults
def __init__(self, path=True, **overrides):
self.vocab = Vocab()
self.tokenizer = Tokenizer()
self.tagger = Tagger()
self.parser = DependencyParser()
self.entity = EntityRecognizer()
self.make_doc = lambda text: Doc()
self.pipeline = [self.tagger, self.parser, self.entity]
def __call__(self, text, **toggle):
doc = self.make_doc(text)
for proc in self.pipeline:
if toggle.get(process.name, True):
process(doc)
return doc
def pipe(self, texts_iterator, batch_size=1000, n_threads=2, **toggle):
docs = (self.make_doc(text) for text in texts_iterator)
for process in self.pipeline:
if toggle.get(process.name, True):
docs = process.pipe(docs, batch_size=batch_size, n_threads=n_threads)
for doc in self.docs:
yield doc
def end_training(self, path=None):
2016-03-31 17:24:48 +03:00
return None
2016-10-21 01:58:24 +03:00
class English(Language):
class Defaults(BaseDefaults):
pass
2016-03-31 17:24:48 +03:00
2016-10-21 01:58:24 +03:00
class German(Language):
class Defaults(BaseDefaults):
pass
2016-10-03 21:19:13 +03:00
+section("english-init")
+h(3, "english-init")
2016-10-21 01:58:24 +03:00
| #[+tag method] Language.__init__
2016-03-31 17:24:48 +03:00
p
2016-10-21 01:58:24 +03:00
| Load the pipeline. You can disable components by passing None as a value,
| e.g. pass parser=None, vectors=None to save memory if you're not using
| those components. You can also pass an object as the value.
| Pass a function create_pipeline to use a custom pipeline --- see
| the custom pipeline tutorial.
2016-10-03 21:19:13 +03:00
2016-03-31 17:24:48 +03:00
+aside("Efficiency").
2016-10-03 21:19:13 +03:00
Loading takes 10-20 seconds, and the instance consumes 2 to 3
2016-03-31 17:24:48 +03:00
gigabytes of memory. Intended use is for one instance to be
created for each language per process, but you can create more
2016-10-21 01:58:24 +03:00
if you're doing something unusual. You may wish to make the
2016-10-03 21:19:13 +03:00
instance a global variable or "singleton".
2016-03-31 17:24:48 +03:00
2016-10-03 21:19:13 +03:00
+table(["Example", "Description"])
2016-03-31 17:24:48 +03:00
+row
2016-10-21 01:58:24 +03:00
+cell #[code nlp = English()]
+cell Load everything, from default path.
2016-03-31 17:24:48 +03:00
+row
2016-10-21 01:58:24 +03:00
+cell #[code nlp = English(path='my_data')]
+cell Load everything, from specified path
2016-03-31 17:24:48 +03:00
+row
2016-10-21 01:58:24 +03:00
+cell #[code nlp = English(path=path_obj)]
+cell Load everything, from an object that follows the #[code pathlib.Path] protocol.
2016-10-03 21:19:13 +03:00
2016-03-31 17:24:48 +03:00
+row
2016-10-21 01:58:24 +03:00
+cell #[code nlp = English(parser=False, vectors=False)]
+cell Load everything except the parser and the word vectors.
2016-03-31 17:24:48 +03:00
+row
2016-10-21 01:58:24 +03:00
+cell #[code nlp = English(parser=my_parser)]
+cell Load everything, and use a custom parser.
2016-03-31 17:24:48 +03:00
+row
2016-10-21 01:58:24 +03:00
+cell #[code nlp = English(create_pipeline=my_pipeline)]
+cell Load everything, and use a custom pipeline.
2016-03-31 17:24:48 +03:00
2016-10-21 01:58:24 +03:00
+code("python", "Definition").
def __init__(self, path=True, **overrides):
D = self.Defaults
self.vocab = Vocab(path=path, parent=self, **D.vocab) \
if 'vocab' not in overrides \
else overrides['vocab']
self.tokenizer = Tokenizer(self.vocab, path=path, **D.tokenizer) \
if 'tokenizer' not in overrides \
else overrides['tokenizer']
self.tagger = Tagger(self.vocab, path=path, **D.tagger) \
if 'tagger' not in overrides \
else overrides['tagger']
self.parser = DependencyParser(self.vocab, path=path, **D.parser) \
if 'parser' not in overrides \
else overrides['parser']
self.entity = EntityRecognizer(self.vocab, path=path, **D.entity) \
if 'entity' not in overrides \
else overrides['entity']
self.matcher = Matcher(self.vocab, path=path, **D.matcher) \
if 'matcher' not in overrides \
else overrides['matcher']
if 'make_doc' in overrides:
self.make_doc = overrides['make_doc']
elif 'create_make_doc' in overrides:
self.make_doc = overrides['create_make_doc'](self)
else:
self.make_doc = lambda text: self.tokenizer(text)
if 'pipeline' in overrides:
self.pipeline = overrides['pipeline']
elif 'create_pipeline' in overrides:
self.pipeline = overrides['create_pipeline'](self)
else:
self.pipeline = [self.tagger, self.parser, self.matcher, self.entity]
+section("language-call")
+h(3, "language-call")
| #[+tag method] Language.__call__
2016-03-31 17:24:48 +03:00
p
2016-10-03 21:19:13 +03:00
| The main entry point to spaCy. Takes raw unicode text, and returns
| a #[code Doc] object, which can be iterated to access #[code Token]
2016-03-31 17:24:48 +03:00
| and #[code Span] objects.
2016-10-03 21:19:13 +03:00
2016-03-31 17:24:48 +03:00
+aside("Efficiency").
2016-10-21 01:58:24 +03:00
spaCy's algorithms are all linear-time, so you can supply
2016-03-31 17:24:48 +03:00
documents of arbitrary length, e.g. whole novels.
2016-10-03 21:19:13 +03:00
+table(["Example", "Description"], "code")
2016-03-31 17:24:48 +03:00
+row
2016-10-21 01:58:24 +03:00
+cell #[ doc = nlp(u'Some text.')]
2016-03-31 17:24:48 +03:00
+cell Apply the full pipeline.
+row
2016-10-21 01:58:24 +03:00
+cell #[ doc = nlp(u'Some text.', parse=False)]
2016-03-31 17:24:48 +03:00
+cell Applies tagger and entity, not parser
+row
2016-10-21 01:58:24 +03:00
+cell #[ doc = nlp(u'Some text.', entity=False)]
2016-03-31 17:24:48 +03:00
+cell Applies tagger and parser, not entity.
+row
2016-10-21 01:58:24 +03:00
+cell #[ doc = nlp(u'Some text.', tag=False)]
2016-03-31 17:24:48 +03:00
+cell Does not apply tagger, entity or parser
+row
2016-10-21 01:58:24 +03:00
+cell #[ doc = nlp(u'')]
2016-03-31 17:24:48 +03:00
+cell Zero-length tokens, not an error
+row
2016-10-21 01:58:24 +03:00
+cell #[ doc = nlp(b'Some text')]
2016-03-31 17:24:48 +03:00
+cell Error: need unicode
+row
2016-10-21 01:58:24 +03:00
+cell #[ doc = nlp(b'Some text'.decode('utf8'))]
2016-03-31 17:24:48 +03:00
+cell Decode bytes into unicode first.
2016-10-03 21:19:13 +03:00
+code("python", "Definition").
2016-03-31 17:24:48 +03:00
def __call__(self, text, tag=True, parse=True, entity=True, matcher=True):
return self
2016-10-03 21:19:13 +03:00
+table(["Name", "Type", "Description"])
2016-03-31 17:24:48 +03:00
+row
+cell text
2016-10-03 21:19:13 +03:00
+cell #[+a(link_unicode) unicode]
2016-03-31 17:24:48 +03:00
+cell.
2016-10-03 21:19:13 +03:00
The text to be processed. spaCy expects raw unicode text
you don"t necessarily need to, say, split it into paragraphs.
However, depending on your documents, you might be better
off applying custom pre-processing. Non-text formatting,
e.g. from HTML mark-up, should be removed before sending
the document to spaCy. If your documents have a consistent
format, you may be able to improve accuracy by pre-processing.
For instance, if the first word of your documents are always
in upper-case, it may be helpful to normalize them before
2016-03-31 17:24:48 +03:00
supplying them to spaCy.
+row
+cell tag
2016-10-03 21:19:13 +03:00
+cell #[+a(link_bool) bool]
2016-03-31 17:24:48 +03:00
+cell.
2016-10-03 21:19:13 +03:00
Whether to apply the part-of-speech tagger. Required for
2016-03-31 17:24:48 +03:00
parsing and entity recognition.
+row
+cell parse
2016-10-03 21:19:13 +03:00
+cell #[+a(link_bool) bool]
2016-03-31 17:24:48 +03:00
+cell.
Whether to apply the syntactic dependency parser.
+row
+cell entity
2016-10-03 21:19:13 +03:00
+cell #[+a(link_bool) bool]
2016-03-31 17:24:48 +03:00
+cell.
Whether to apply the named entity recognizer.
2016-10-03 21:19:13 +03:00
+section("english-pipe")
+h(3, "english-pipe")
| #[+tag method] English.pipe
2016-03-31 17:24:48 +03:00
p
2016-10-03 21:19:13 +03:00
| Parse a sequence of texts into a sequence of #[code Doc] objects.
| Accepts a generator as input, and produces a generator as output.
| Internally, it accumulates a buffer of #[code batch_size]
| texts, works on them with #[code n_threads] workers in parallel,
2016-03-31 17:24:48 +03:00
| and then yields the #[code Doc] objects one by one.
2016-10-03 21:19:13 +03:00
+aside("Efficiency").
spaCy releases the global interpreter lock around the parser and
named entity recognizer, allowing shared-memory parallelism via
OpenMP. However, OpenMP is not supported on OSX — so multiple
2016-03-31 17:24:48 +03:00
threads will only be used on Linux and Windows.
2016-10-03 21:19:13 +03:00
+table(["Example", "Description"], "usage")
2016-03-31 17:24:48 +03:00
+row
2016-10-03 21:19:13 +03:00
+cell #[+a("https://github.com/" + SOCIAL.github + "/spaCy/blob/master/examples/parallel_parse.py") parallel_parse.py]
2016-03-31 17:24:48 +03:00
+cell Parse comments from Reddit in parallel.
2016-10-03 21:19:13 +03:00
+code("python", "Definition").
2016-03-31 17:24:48 +03:00
def pipe(self, texts, n_threads=2, batch_size=1000):
yield Doc()
2016-10-03 21:19:13 +03:00
+table(["Arg", "Type", "Description"])
2016-03-31 17:24:48 +03:00
+row
+cell texts
+cell
+cell.
2016-10-03 21:19:13 +03:00
A sequence of unicode objects. Usually you will want this
to be a generator, so that you don"t need to have all of
2016-03-31 17:24:48 +03:00
your texts in memory.
+row
+cell n_threads
2016-10-03 21:19:13 +03:00
+cell #[+a(link_int) int]
2016-03-31 17:24:48 +03:00
+cell.
2016-10-03 21:19:13 +03:00
The number of worker threads to use. If -1, OpenMP will
2016-03-31 17:24:48 +03:00
decide how many to use at run time. Default is 2.
+row
+cell batch_size
2016-10-03 21:19:13 +03:00
+cell #[+a(link_int) int]
2016-03-31 17:24:48 +03:00
+cell.
2016-10-03 21:19:13 +03:00
The number of texts to buffer. Let"s say you have a
#[code batch_size] of 1,000. The input, #[code texts], is
a generator that yields the texts one-by-one. We want to
operate on them in parallel. So, we accumulate a work queue.
Instead of taking one document from #[code texts] and
operating on it, we buffer #[code batch_size] documents,
work on them in parallel, and then yield them one-by-one.
2016-03-31 17:24:48 +03:00
Higher #[code batch_size] therefore often results in better
parallelism, up to a point.