mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-05 22:20:34 +03:00
90 lines
3.2 KiB
ReStructuredText
90 lines
3.2 KiB
ReStructuredText
================
|
|
spacy.en.English
|
|
================
|
|
|
|
|
|
99\% of the time, you will load spaCy's resources using a language pipeline class,
|
|
e.g. `spacy.en.English`. The pipeline class reads the data from disk, from a
|
|
specified directory. By default, spaCy installs data into each language's
|
|
package directory, and loads it from there.
|
|
|
|
Usually, this is all you will need:
|
|
|
|
>>> from spacy.en import English
|
|
>>> nlp = English()
|
|
|
|
If you need to replace some of the components, you may want to just make your
|
|
own pipeline class --- the English class itself does almost no work; it just
|
|
applies the modules in order. You can also provide a function or class that
|
|
produces a tokenizer, tagger, parser or entity recognizer to :code:`English.__init__`,
|
|
to customize the pipeline:
|
|
|
|
>>> from spacy.en import English
|
|
>>> from my_module import MyTagger
|
|
>>> nlp = English(Tagger=MyTagger)
|
|
|
|
The text processing API is very small and simple. Everything is a callable object,
|
|
and you will almost always apply the pipeline all at once.
|
|
|
|
|
|
.. py:class:: spacy.en.English
|
|
|
|
.. py:method:: __init__(self, data_dir=..., Tokenizer=..., Tagger=..., Parser=..., Entity=..., Matcher=..., Packer=None, load_vectors=True)
|
|
|
|
:param unicode data_dir:
|
|
The data directory. May be None, to disable any data loading (including
|
|
the vocabulary).
|
|
|
|
:param Tokenizer:
|
|
A class/function that creates the tokenizer.
|
|
|
|
:param Tagger:
|
|
A class/function that creates the part-of-speech tagger.
|
|
|
|
:param Parser:
|
|
A class/function that creates the dependency parser.
|
|
|
|
:param Entity:
|
|
A class/function that creates the named entity recogniser.
|
|
|
|
:param bool load_vectors:
|
|
A boolean value to control whether the word vectors are loaded.
|
|
|
|
.. py:method:: __call__(text, tag=True, parse=True, entity=True) --> Doc
|
|
|
|
:param unicode text:
|
|
The text to be processed. No pre-processing needs to be applied, and any
|
|
length of text can be submitted. Usually you will submit a whole document.
|
|
Text may be zero-length. An exception is raised if byte strings are supplied.
|
|
|
|
:param bool tag:
|
|
Whether to apply the part-of-speech tagger. Required for parsing and entity
|
|
recognition.
|
|
|
|
:param bool parse:
|
|
Whether to apply the syntactic dependency parser.
|
|
|
|
:param bool entity:
|
|
Whether to apply the named entity recognizer.
|
|
|
|
:return: A document
|
|
:rtype: :py:class:`spacy.tokens.Doc`
|
|
|
|
:Example:
|
|
|
|
>>> from spacy.en import English
|
|
>>> nlp = English()
|
|
>>> doc = nlp(u'Some text.) # Applies tagger, parser, entity
|
|
>>> doc = nlp(u'Some text.', parse=False) # Applies tagger and entity, not parser
|
|
>>> doc = nlp(u'Some text.', entity=False) # Applies tagger and parser, not entity
|
|
>>> doc = nlp(u'Some text.', tag=False) # Does not apply tagger, entity or parser
|
|
>>> doc = nlp(u'') # Zero-length tokens, not an error
|
|
>>> doc = nlp(b'Some text') # Error: need unicode
|
|
Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
File "spacy/en/__init__.py", line 128, in __call__
|
|
tokens = self.tokenizer(text)
|
|
TypeError: Argument 'string' has incorrect type (expected unicode, got str)
|
|
>>> doc = nlp(b'Some text'.decode('utf8')) # Encode to unicode first.
|
|
>>>
|