mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-15 10:42:34 +03:00
606 lines
38 KiB
HTML
606 lines
38 KiB
HTML
|
||
<!-- TODO-->
|
||
<!-- Doc-->
|
||
<!-- to_array-->
|
||
<!-- count_by-->
|
||
<!-- from_array-->
|
||
<!-- from_bytes-->
|
||
<!-- to_bytes-->
|
||
<!-- read_bytes-->
|
||
<!-- -->
|
||
<!-- Token-->
|
||
<!-- Constructors-->
|
||
<!-- Examples for repvec. Rename?-->
|
||
<!-- Link Simple Good Turing in prob-->
|
||
<!---->
|
||
<!-- Span-->
|
||
<!-- Constructors-->
|
||
<!-- Convert details to Define lists-->
|
||
<!-- Styling of elements in Parse. Improve Span.root documentation-->
|
||
<!-- -->
|
||
<!-- Lexeme-->
|
||
<!-- Constructors-->
|
||
<!---->
|
||
<!-- Vocab-->
|
||
<!-- Constructors-->
|
||
<!---->
|
||
<!-- StringStore-->
|
||
<!-- Constructors-->
|
||
<details>
|
||
<summary><a name="pipeline"><span class="declaration"><span class="label">class</span><code>English</code></span></a></summary>
|
||
<p>Load models into a callable object to process English text. Intended use is for one instance to be created per process. You can create more if you're doing something unusual. You may wish to make the instance a global variable or "singleton". We usually instantiate the object in the <code>main()</code> function and pass it around as an explicit argument. </p>
|
||
<pre class="language-python"><code>from spacy.en import English
|
||
from spacy._doc_examples import download_war_and_peace
|
||
|
||
unprocessed_unicode = download_war_and_peace()
|
||
|
||
nlp = English()
|
||
doc = nlp(unprocessed_unicode)</code></pre>
|
||
<details open="open">
|
||
<summary><a><span class="declaration"><code>__init__</code><span class="parameters">self, data_dir=True, Tagger=True, Parser=True, Entity=True, Matcher=True, Packer=None, load_vectors=True</span></span></a></summary>
|
||
<p>Load the resources. Loading takes 20 seconds, and the instance consumes 2 to 3 gigabytes of memory.</p>
|
||
<p>Load data from default directory:</p>
|
||
<pre class="language-python"><code>>>> nlp = English()
|
||
>>> nlp = English(data_dir=u'')</code></pre>
|
||
<p>Load data from specified directory:</p>
|
||
<pre class="language-python"><code>>>> nlp = English(data_dir=u'path/to/data_directory')</code></pre>
|
||
<p>Disable (and avoid loading) parts of the processing pipeline:</p>
|
||
<pre class="language-python"><code>>>> nlp = English(load_vectors=False, Parser=False, Tagger=False, Entity=False)</code></pre>
|
||
<p>Start with nothing loaded:</p>
|
||
<pre class="language-python"><code>>>> nlp = English(data_dir=None)</code></pre>
|
||
<ul>
|
||
<li><strong>data_dir</strong> –
|
||
|
||
The data directory. May be , to disable any data loading (including the vocabulary).
|
||
</li>
|
||
<li><strong>Tagger</strong> – A class/function that creates the part-of-speech tagger. Usually this is left <code>True</code>, to load the default tagger. If falsey, no tagger is loaded.
|
||
<p>You can also supply your own class/function, which will be called once on setup. The returned function will then be called in <code>English.__call__</code>. The function passed must accept two arguments, of types <code>(StringStore, directory)</code>, and produce a function that accepts one argument, of type <code>Doc</code>. Its return type is unimportant.</p>
|
||
</li>
|
||
<li><strong>Parser</strong> – A class/function that creates the syntactic dependency parser. Usually this is left <code>True</code>, to load the default tagger. If falsey, no parser is loaded.
|
||
<p>You can also supply your own class/function, which will be called once on setup. The returned function will then be called in <code>English.__call__</code>. The function passed must accept two arguments, of types <code>(StringStore, directory)</code>, and produce a function that accepts one argument, of type <code>Doc</code>. Its return type is unimportant.</p>
|
||
</li>
|
||
<li><strong>Entity</strong> – A class/function that creates the named entity recogniser. Usually this is left <code>True</code>, to load the default tagger. If falsey, no entity recognizer is loaded.
|
||
<p>You can also supply your own class/function, which will be called once on setup. The returned function will then be called in <code>English.__call__</code>. The function passed must accept two arguments, of types <code>(StringStore, directory)</code>, and produce a function that accepts one argument, of type <code>Doc</code>. Its return type is unimportant.</p>
|
||
</li>
|
||
<li><strong>load_vectors</strong> –
|
||
|
||
A boolean value to control whether the word vectors are loaded.
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details open="true">
|
||
<summary><a name="English-__call__"><span class="declaration"><code>__call__</code><span class="parameters">text, tag=True, parse=True, entity=True</span></span></a></summary>
|
||
<p>The main entry point to spaCy. Takes raw unicode text, and returns a <code>Doc</code> object, which can be iterated to access <code>Token</code> and <code>Span</code> objects. spaCy's models are all linear-time, so you can supply documents of arbitrary length, e.g. whole novels.</p>
|
||
<ul>
|
||
<li><strong>text</strong> (<a class="reference" href="http://docs.python.org/library/functions.html#unicode"><em>unicode</em></a>) –The text to be processed. spaCy expects raw unicode txt – you don't necessarily need to, say, split it into paragraphs. However, depending on your documents, you might be better off applying custom pre-processing. Non-text formatting, e.g. from HTML mark-up, should be removed before sending the document to spaCy. If your documents have a consistent format, you may be able to improve accuracy by pre-processing. For instance, if the first word of your documents are always in upper-case, it may be helpful to normalize them before supplying them to spaCy.
|
||
</li>
|
||
<li><strong>tag</strong> (<a class="reference" href="http://docs.python.org/library/functions.html#bool"><em>bool</em></a>) –Whether to apply the part-of-speech tagger. Required for parsing and entity recognition.
|
||
</li>
|
||
<li><strong>parse</strong> (<a class="reference" href="http://docs.python.org/library/functions.html#bool"><em>bool</em></a>) – Whether to apply the syntactic dependency parser.
|
||
</li>
|
||
<li><strong>entity</strong> (<a class="reference" href="http://docs.python.org/library/functions.html#bool"><em>bool</em></a>) –Whether to apply the named entity recognizer.
|
||
</li>
|
||
</ul>
|
||
<pre class="language-python"><code>from spacy.en import English
|
||
nlp = English()
|
||
doc = nlp(u'Some text.) # Applies tagger, parser, entity
|
||
doc = nlp(u'Some text.', parse=False) # Applies tagger and entity, not parser
|
||
doc = nlp(u'Some text.', entity=False) # Applies tagger and parser, not entity
|
||
doc = nlp(u'Some text.', tag=False) # Does not apply tagger, entity or parser
|
||
doc = nlp(u'') # Zero-length tokens, not an error
|
||
# doc = nlp(b'Some text') <-- Error: need unicode
|
||
doc = nlp(b'Some text'.decode('utf8')) # Encode to unicode first.</code></pre>
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary><a name="doc"><span class="declaration"><span class="label">class</span><code>Doc</code></span></a></summary>
|
||
<p>A sequence of <code>Token</code> objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings.</p>
|
||
<p>Internally, the <code>Doc</code> object holds an array of <code>TokenC</code> structs. The Python-level <code>Token</code> and <code>Span</code> objects are views of this array, i.e. they don't own the data themselves. This details of the internals shouldn't matter for the API – but it may help you read the code, and understand how spaCy is designed.</p>
|
||
<details>
|
||
<summary>
|
||
<h4>Constructors</h4>
|
||
</summary><a href="#English-__call__"><span class="declaration"><span class="label">via</span><code>English.__call__(unicode text)</code></span></a>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>__init__</code><span class="parameters">self, vocab, orth_and_spaces=None</span></span></a></summary> This method of constructing a <code>Doc</code> object is usually only used for deserialization. Standard usage is to construct the document via a call to the language object.
|
||
<ul>
|
||
<li><strong>vocab</strong> – A Vocabulary object, which must match any models you want to use (e.g. tokenizer, parser, entity recognizer).
|
||
</li>
|
||
<li><strong>orth_and_spaces</strong> – A list of <code>(orth_id, has_space)</code> tuples, where <code>orth_id</code> is an integer, and has_space is a boolean, indicating whether the token has a trailing space.
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Sequence API</h4>
|
||
</summary>
|
||
<li><span class="declaration"><code>doc[i]</code></span> Get the <code>Token</code> object at position <code>i</code>, where <code>i</code> is an integer. Negative indexing is supported, and follows the usual Python semantics, i.e. <code>doc[-2]</code> is <code>doc[len(doc) - 2]</code>.
|
||
</li>
|
||
<li><span class="declaration"><code>doc[start : end]</code></span> Get a <code>Span</code> object, starting at position <code>start</code> and ending at position <code>end</code>. For instance, <code>doc[2:5]</code> produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. <code>doc[start : end : step]</code>) are not supported, as <code>Span</code> objects must be contiguous (cannot have gaps).
|
||
</li>
|
||
<li><span class="declaration"><code>for token in doc</code></span>Iterate over <code>Token </code> objects, from which the annotations can be easily accessed. This is the main way of accessing <code>Token</code> objects, which are the main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython, via <code>Doc.data</code>, an array of <code>TokenC</code> structs. The C API has not yet been finalized, and is subject to change.
|
||
</li>
|
||
<li><span class="declaration"><code>len(doc)</code></span> The number of tokens in the document.
|
||
</li>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Sentence, entity and noun chunk spans</h4>
|
||
</summary>
|
||
<details>
|
||
<summary><span class="declaration"><code>sents</code></span></summary>
|
||
<p> Yields sentence <code>Span</code> objects. Iterate over the span to get individual <code>Token</code> objects. Sentence spans have no label.
|
||
<pre class="language-python"><code>>>> from spacy.en import English
|
||
>>> nlp = English()
|
||
>>> doc = nlp(u'This is a sentence. Here's another...')
|
||
>>> for sentence in doc.sents:
|
||
... sentence.root.orth_
|
||
is
|
||
's</code></pre>
|
||
</p>
|
||
</details>
|
||
<details>
|
||
<summary><span class="declaration"><code>ents</code></span></summary>
|
||
<p> Yields named-entity <code>Span</code> objects. Iterate over the span to get individual <code>Token</code> objects, or access the label:
|
||
<pre><code>>>> from spacy.en import English
|
||
>>> nlp = English()
|
||
>>> tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
|
||
>>> ents = list(tokens.ents)
|
||
>>> ents[0].label, ents[0].label_, ents[0].orth_, ents[0].string
|
||
(112504, 'PERSON', 'Best', ents[0].string) </code></pre>
|
||
</p>
|
||
</details>
|
||
<details>
|
||
<summary><span class="declaration"><code>noun_chunks</code></span></summary>
|
||
<p> Yields base noun-phrase <code>Span </code> objects. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses. For example:
|
||
<pre class="language-python"><code>>>> from spacy.en import English
|
||
>>> nlp = English()
|
||
>>> doc = nlp('The sentence in this example has three noun chunks.')
|
||
>>> for chunk in doc.noun_chunks:
|
||
... print(chunk.label, chunk.orth_, '<--', chunk.root.head.orth_)
|
||
NP The sentence <-- has
|
||
NP this example <-- in
|
||
NP three noun chunks <-- has</code></pre>
|
||
</p>
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Export/Import</h4>
|
||
</summary>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>to_array</code><span class="parameters">attr_ids</span></span></a></summary>Given a list of M attribute IDs, export the tokens to a numpy ndarray of shape N*M, where N is the length of the sentence.
|
||
<ul>
|
||
<li><strong>attr_ids</strong> (list[int]) –A list of attribute ID ints. Attribute IDs can be imported from <code>spacy.attrs</code>
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>count_by</code><span class="parameters">attr_id</span></span></a></summary>Produce a dict of <code>{attribute (int): count (ints)}</code> frequencies, keyed by the values of the given attribute ID.
|
||
<pre class="language-python"><code>>>> from spacy.en import English, attrs
|
||
>>> nlp = English()
|
||
>>> tokens = nlp(u'apple apple orange banana')
|
||
>>> tokens.count_by(attrs.ORTH)
|
||
{12800L: 1, 11880L: 2, 7561L: 1}
|
||
>>> tokens.to_array([attrs.ORTH])
|
||
array([[11880],
|
||
[11880],
|
||
[7561],
|
||
[12800]])</code></pre>
|
||
</details>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>from_array</code><span class="parameters">attrs, array</span></span></a></summary>Write to a <code>Doc</code> object, from an M*N array of attributes.
|
||
</details>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>from_bytes</code><span class="parameters"></span></span></a></summary>Deserialize, loading from bytes.
|
||
</details>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>to_bytes</code><span class="parameters"></span></span></a></summary>Serialize, producing a byte string.
|
||
</details>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>read_bytes</code><span class="parameters"></span></span></a></summary>classmethod
|
||
</details>
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary><a name="token"><span class="declaration"><span class="label">class</span><code>Token</code></span></a></summary>A Token represents a single word, punctuation or significant whitespace symbol. Integer IDs are provided for all string features. The (unicode) string is provided by an attribute of the same name followed by an underscore, e.g. <code>token.orth</code> is an integer ID, <code>token.orth_</code> is the unicode value. The only exception is the Token.string attribute, which is (unicode) string-typed.
|
||
<details>
|
||
<summary>
|
||
<h4>String Features</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>lemma / lemma_</code></span>The "base" of the word, with no inflectional suffixes, e.g. the lemma of "developing" is "develop", the lemma of "geese" is "goose", etc. Note that <em>derivational</em> suffixes are not stripped, e.g. the lemma of "instutitions" is "institution", not "institute". Lemmatization is performed using the WordNet data, but extended to also cover closed-class words such as pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his". We assign pronouns the lemma <code>-PRON-</code>.
|
||
</li>
|
||
</ul>
|
||
<ul>
|
||
<li><span class="declaration"><code>orth / orth_</code></span>The form of the word with no string normalization or processing, as it appears in the string, without trailing whitespace.
|
||
</li>
|
||
<li><span class="declaration"><code>lower / lower_</code></span>The form of the word, but forced to lower-case, i.e. <code class="language-python">lower = word.orth_.lower()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>shape / shape_</code></span>A transform of the word's string, to show orthographic features. The characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d. After these mappings, sequences of 4 or more of the same character are truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx, :) --> :)
|
||
</li>
|
||
<li><span class="declaration"><code>prefix / prefix_</code></span>A length-N substring from the start of the word. Length may vary by language; currently for English n=1, i.e. <code class="language-python">prefix = word.orth_[:1]</code>
|
||
</li>
|
||
<li><span class="declaration"><code>suffix / suffix_</code></span>A length-N substring from the end of the word. Length may vary by language; currently for English n=3, i.e. <code class="language-python">suffix = word.orth_[-3:]</code>
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Boolean Flags</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>is_alpha</code></span> Equivalent to <code class="language-python">word.orth_.isalpha()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_ascii</code></span> Equivalent to <code class="language-python">any(ord(c) >= 128 for c in word.orth_)</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_digit</code></span> Equivalent to <code class="language-python">word.orth_.isdigit()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_lower</code></span> Equivalent to <code class="language-python">word.orth_.islower()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_title</code></span> Equivalent to <code class="language-python">word.orth_.istitle()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_punct</code></span> Equivalent to <code class="language-python">word.orth_.ispunct()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_space</code></span> Equivalent to <code class="language-python">word.orth_.isspace()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>like_url</code></span> Does the word resembles a URL?
|
||
</li>
|
||
<li><span class="declaration"><code>like_num</code></span> Does the word represent a number? e.g. “10.9”, “10”, “ten”, etc
|
||
</li>
|
||
<li><span class="declaration"><code>like_email</code></span> Does the word resemble an email?
|
||
</li>
|
||
<li><span class="declaration"><code>is_oov</code></span> Is the word out-of-vocabulary?
|
||
</li>
|
||
</ul>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>check_flag</code><span class="parameters">flag_id</span></span></a></summary>Get the value of one of the boolean flags
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Distributional Features</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>prob</code></span> The unigram log-probability of the word, estimated from counts from a large corpus, smoothed using Simple Good Turing estimation.
|
||
</li>
|
||
<li><span class="declaration"><code>cluster</code></span> The Brown cluster ID of the word. These are often useful features for linear models. If you’re using a non-linear model, particularly a neural net or random forest, consider using the real-valued word representation vector, in Token.repvec, instead.
|
||
</li>
|
||
<li><span class="declaration"><code>repvec</code></span> A “word embedding” representation: a dense real-valued vector that supports similarity queries between words. By default, spaCy currently loads vectors produced by the Levy and Goldberg (2014) dependency-based word2vec model.
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Alignment and Output</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>idx</code></span>Start index of the token in the string
|
||
</li>
|
||
<li><span class="declaration"><code>len(token)</code></span>Length of the token's orth string, in unicode code-points.
|
||
</li>
|
||
<li><span class="declaration"><code>unicode(token)</code></span>Same as token.orth_
|
||
</li>
|
||
<li><span class="declaration"><code>str(token)</code></span>In Python 3, returns <code>token.orth_</code>. In Python 2, returns<code>token.orth_.encode('utf8')</code>
|
||
</li>
|
||
<li><span class="declaration"><code>string</code></span><code>token.orth_ + token.whitespace_</code>, i.e. the form of the word as it appears in the string,
|
||
<including>trailing whitespace</including>. This is useful when you need to use linguistic features to add inline mark-up to the string.
|
||
</li>
|
||
<li><span class="declaration"><code>whitespace_</code></span>The number of immediate syntactic children following the word in the string.
|
||
</li>
|
||
</ul>
|
||
<define>
|
||
<summary>
|
||
<h4>Navigating the Parse Tree</h4>
|
||
</summary>
|
||
<li><span class="declaration"><code>head</code></span>The immediate syntactic head of the token. If the token is the root of its sentence, it is the token itself, i.e. <code>root_token.head is root_token</code>
|
||
</li>
|
||
<li><span class="declaration"><code>children</code></span>An iterator that yields from lefts, and then yields from rights.
|
||
</li>
|
||
<li><span class="declaration"><code>subtree</code></span>An iterator for the part of the sentence syntactically governed by the word, including the word itself.
|
||
</li>
|
||
<li><span class="declaration"><code>left_edge</code></span>The leftmost edge of the token's subtree
|
||
</li>
|
||
<li><span class="declaration"><code>right_edge</code></span>The rightmost edge of the token's subtree
|
||
</li>
|
||
</define>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>nbor(i=1)</code><span class="parameters"></span></span></a></summary>Get the <em>i</em>th next / previous neighboring token.
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Named Entities</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>ent_type</code></span>If the token is part of an entity, its entity type.
|
||
</li>
|
||
<li><span class="declaration"><code>ent_iob</code></span>The IOB (inside, outside, begin) entity recognition tag for the token.
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Constructors</h4>
|
||
</summary>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>__init__</code><span class="parameters">vocab, doc, offset</span></span></a></summary>
|
||
<ul>
|
||
<li><strong>vocab</strong> –A Vocab object
|
||
</li>
|
||
<li><strong>doc</strong> –The parent sequence
|
||
</li>
|
||
<li><strong>offset</strong> (<a class="reference" href="http://docs.python.org/library/functions.html#int"><em>int</em></a>) –The index of the token within the document
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<!--+attribute("conjuncts")-->
|
||
<!-- | Conjuncts-->
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary><a name="span"><span class="declaration"><span class="label">class</span><code>Span</code></span></a></summary>A <code>Span</code> is a slice of a <code>Doc</code> object, consisting of zero or more tokens. Spans are used to represent sentences, named entities, phrases, and arbitrary contiguous slices from the <code>Doc</code> object. <code>Span</code> objects are views – that is, they do not copy the underlying C data. This makes them cheap to construct, as internally are simply a reference to the <code>Doc</code> object, a start position, an end position, and a label ID.
|
||
<li><span class="declaration"><code>token = span[i]</code></span>Get the <code>Token</code> object at position <em>i</em>, where <em>i</em> is an offset within the <code>Span</code>, not the document. That is:
|
||
<pre class="language-python"><code>span = doc[4:6]
|
||
token = span[0]
|
||
assert token.i == 4</code></pre>
|
||
</li>
|
||
<ul>
|
||
<li><span class="declaration"><code>for token in span</code></span>Iterate over the <code>Token</code> objects in the span.
|
||
</li>
|
||
<li><span class="declaration"><code>__len__</code></span>Number of tokens in the span.
|
||
</li>
|
||
<li><span class="declaration"><code>start</code></span>The start offset of the span, i.e. <code class="language-python">span[0].i</code>.
|
||
</li>
|
||
<li><span class="declaration"><code>end</code></span>The end offset of the span, i.e. <code class="language-python">span[-1].i + 1</code>
|
||
</li>
|
||
</ul>
|
||
<details>
|
||
<summary>
|
||
<h4>Navigating the Parse Tree</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>root</code></span>The first ancestor of the first word of the span that has its head outside the span. For example:
|
||
<pre class="language-python"><code>>>> toks = nlp(u'I like New York in Autumn.')</code></pre>
|
||
<p>Let's name the indices --- easier than writing <code>toks[4]</code> etc.</p>
|
||
<pre class="language-python"><code>>>> i, like, new, york, in_, autumn, dot = range(len(toks)) </code></pre>
|
||
<p>The head of <em>new</em> is <em>York</em>, and the head of <em>York</em> is <em>like</em></p>
|
||
<pre class="language-python"><code>>>> toks[new].head.orth_
|
||
'York'
|
||
>>> toks[york].head.orth_
|
||
'like'</code></pre>
|
||
<p>Create a span for "New York". Its root is "York".</p>
|
||
<pre class="language-python"><code>>>> new_york = toks[new:york+1]
|
||
>>> new_york.root.orth_
|
||
'York'</code></pre>
|
||
<p>When there are multiple words with external dependencies, we take the first:</p>
|
||
<pre class="language-python"><code>>>> toks[autumn].head.orth_, toks[dot].head.orth_
|
||
('in', like')
|
||
>>> autumn_dot = toks[autumn:]
|
||
>>> autumn_dot.root.orth_
|
||
'Autumn'</code></pre>
|
||
</li>
|
||
<li><span class="declaration"><code>lefts</code></span>Tokens that are to the left of the span, whose head is within the span, i.e. <code class="language-python">
|
||
lefts = [span.doc[i] for i in range(0, span.start)
|
||
if span.doc[i].head in span]</code>
|
||
</li>
|
||
<li><span class="declaration"><code>rights</code></span>Tokens that are to the right of the span, whose head is within the span, i.e.
|
||
<pre class="language-python"><code>rights = [span.doc[i] for i in range(span.end, len(span.doc))
|
||
if span.doc[i].head in span]</code></pre>
|
||
</li>
|
||
</ul>
|
||
<li><span class="declaration"><code>subtree</code></span>Tokens in the range <code>(start, end+1)</code>, where <code>start</code> is the index of the leftmost word descended from a token in the span, and <code>end</code> is the index of the rightmost token descended from a token in the span.
|
||
</li>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Constructors</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>doc[start : end]</code></span>
|
||
</li>
|
||
<li><span class="declaration"><code>for entity in doc.ents</code></span>
|
||
</li>
|
||
<li><span class="declaration"><code>for sentence in doc.sents</code></span>
|
||
</li>
|
||
<li><span class="declaration"><code>for noun_phrase in doc.noun_chunks</code></span>
|
||
</li>
|
||
<li><span class="declaration"><code>span = Span(doc, start, end, label=0)</code></span>
|
||
</li>
|
||
</ul>
|
||
<details>
|
||
<summary><a><span class="declaration"><code>__init__</code><span class="parameters"></span></span></a></summary>Temp <code>span = doc[0:4]</code>
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>String Views</h4>
|
||
</summary>
|
||
<details open="open">
|
||
<summary><span class="declaration"><code>string</code></span></summary>
|
||
<p>String
|
||
</p>
|
||
</details>
|
||
<details open="open">
|
||
<summary><span class="declaration"><code>lemma / lemma_</code></span></summary>
|
||
<p>String
|
||
</p>
|
||
</details>
|
||
<details open="open">
|
||
<summary><span class="declaration"><code>label / label_</code></span></summary>
|
||
<p>String
|
||
</p>
|
||
</details>
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary><a name="lexeme"><span class="declaration"><span class="label">class</span><code>Lexeme</code></span></a></summary>
|
||
<p>The Lexeme object represents a lexical type, stored in the vocabulary – as opposed to a token, occurring in a document.</p>
|
||
<p>Lexemes store various features, so that these features can be computed once per type, rather than once per token. As job sizes grow, this can amount to a substantial efficiency improvement.</p>
|
||
<p>All Lexeme attributes are therefore context independent, as a single lexeme is reused for all usages of that word. Lexemes are keyed by the “orth” attribute. </p>
|
||
<p>All Lexeme attributes are accessible directly on the Token object.</p>
|
||
<details>
|
||
<summary>
|
||
<h4>String Features</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>orth / orth_</code></span>The form of the word with no string normalization or processing, as it appears in the string, without trailing whitespace.
|
||
</li>
|
||
<li><span class="declaration"><code>lower / lower_</code></span>The form of the word, but forced to lower-case, i.e. <code class="language-python">lower = word.orth_.lower()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>shape / shape_</code></span>A transform of the word's string, to show orthographic features. The characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d. After these mappings, sequences of 4 or more of the same character are truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx, :) --> :)
|
||
</li>
|
||
<li><span class="declaration"><code>prefix / prefix_</code></span>A length-N substring from the start of the word. Length may vary by language; currently for English n=1, i.e. <code class="language-python">prefix = word.orth_[:1]</code>
|
||
</li>
|
||
<li><span class="declaration"><code>suffix / suffix_</code></span>A length-N substring from the end of the word. Length may vary by language; currently for English n=3, i.e. <code class="language-python">suffix = word.orth_[-3:]</code>
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Boolean Features</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>is_alpha</code></span> Equivalent to <code class="language-python">word.orth_.isalpha()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_ascii</code></span> Equivalent to <code class="language-python">any(ord(c) >= 128 for c in word.orth_)</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_digit</code></span> Equivalent to <code class="language-python">word.orth_.isdigit()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_lower</code></span> Equivalent to <code class="language-python">word.orth_.islower()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_title</code></span> Equivalent to <code class="language-python">word.orth_.istitle()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_punct</code></span> Equivalent to <code class="language-python">word.orth_.ispunct()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>is_space</code></span> Equivalent to <code class="language-python">word.orth_.isspace()</code>
|
||
</li>
|
||
<li><span class="declaration"><code>like_url</code></span> Does the word resembles a URL?
|
||
</li>
|
||
<li><span class="declaration"><code>like_num</code></span> Does the word represent a number? e.g. “10.9”, “10”, “ten”, etc
|
||
</li>
|
||
<li><span class="declaration"><code>like_email</code></span> Does the word resemble an email?
|
||
</li>
|
||
<li><span class="declaration"><code>is_oov</code></span> Is the word out-of-vocabulary?
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Distributional Features</h4>
|
||
</summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>prob</code></span> The unigram log-probability of the word, estimated from counts from a large corpus, smoothed using Simple Good Turing estimation.
|
||
</li>
|
||
<li><span class="declaration"><code>cluster</code></span> The Brown cluster ID of the word. These are often useful features for linear models. If you’re using a non-linear model, particularly a neural net or random forest, consider using the real-valued word representation vector, in Token.repvec, instead.
|
||
</li>
|
||
<li><span class="declaration"><code>repvec</code></span> A “word embedding” representation: a dense real-valued vector that supports similarity queries between words. By default, spaCy currently loads vectors produced by the Levy and Goldberg (2014) dependency-based word2vec model.
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Constructors</h4>
|
||
</summary>
|
||
<details open="open">
|
||
<summary><a><span class="declaration"><code>__init__</code><span class="parameters"></span></span></a></summary>
|
||
<p>Init</p>
|
||
</details>
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary><a><span class="declaration"><span class="label">class</span><code>Vocab</code></span></a></summary>
|
||
<ul>
|
||
<li><span class="declaration"><code>lexeme = vocab[integer_id]</code></span>Get a lexeme by its orth ID
|
||
</li>
|
||
<li><span class="declaration"><code>lexeme = vocab[string]</code></span>Get a lexeme by the string corresponding to its orth ID.
|
||
</li>
|
||
<li><span class="declaration"><code>for lexeme in vocab</code></span>Iterate over <code>Lexeme</code> objects
|
||
</li>
|
||
<li><span class="declaration"><code>vocab[integer_id] = attributes_dict</code></span>A props dictionary
|
||
</li>
|
||
<li><span class="declaration"><code>len(vocab)</code></span>Number of lexemes (unique words) in the
|
||
</li>
|
||
</ul>
|
||
<details>
|
||
<summary>
|
||
<h4>Constructors</h4>
|
||
</summary>
|
||
<details open="open">
|
||
<summary><a><span class="declaration"><code>__init__</code><span class="parameters"></span></span></a></summary>Tmp
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Save and Load</h4>
|
||
</summary>
|
||
<details open="open">
|
||
<summary><a><span class="declaration"><code>dump</code><span class="parameters">loc</span></span></a></summary>
|
||
<ul>
|
||
<li><strong>loc</strong> (<a class="reference" href="http://docs.python.org/library/functions.html#unicode"><em>unicode</em></a>) –Path where the vocabulary should be saved
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details open="open">
|
||
<summary><a><span class="declaration"><code>load_lexemes</code><span class="parameters">loc</span></span></a></summary>
|
||
<ul>
|
||
<li><strong>loc</strong> (<a class="reference" href="http://docs.python.org/library/functions.html#unicode"><em>unicode</em></a>) –Path to load the lexemes.bin file from
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
<details open="open">
|
||
<summary><a><span class="declaration"><code>load_vectors</code><span class="parameters">loc</span></span></a></summary>
|
||
<ul>
|
||
<li><strong>loc</strong> (<a class="reference" href="http://docs.python.org/library/functions.html#unicode"><em>unicode</em></a>) –Path to load the vectors.bin from
|
||
</li>
|
||
</ul>
|
||
</details>
|
||
</details>
|
||
</details>
|
||
<details>
|
||
<summary><a><span class="declaration"><span class="label">class</span><code>StringStore</code></span></a></summary>
|
||
<p>Intern strings, and map them to sequential integer IDs. The mapping table is very efficient , and a small-string optimization is used to maintain a small memory footprint. Only the integer IDs are held by spaCy's data classes (<code>Doc</code>, <code>Token</code>, <code>Span</code> and <code>Lexeme</code>) – when you use a string-valued attribute like <code>token.orth_</code>, you access a property that computes <code>token.strings[token.orth]</code>.</p>
|
||
<ul>
|
||
<li><span class="declaration"><code>string = string_store[int_id]</code></span>Retrieve a string from a given integer ID. If the integer ID is not found, raise <code>IndexError</code>
|
||
</li>
|
||
<li><span class="declaration"><code>int_id = string_store[unicode_string]</code></span> Map a unicode string to an integer ID. If the string is previously unseen, it is interned, and a new ID is returned.
|
||
</li>
|
||
<li><span class="declaration"><code>int_id = string_store[utf8_byte_string]</code></span> Byte strings are assumed to be in UTF-8 encoding. Strings encoded with other codecs may fail silently. Given a utf8 string, the behaviour is the same as for unicode strings. Internally, strings are stored in UTF-8 format. So if you start with a UTF-8 byte string, it's less efficient to first decode it as unicode, as StringStore will then have to encode it as UTF-8 once again.
|
||
</li>
|
||
<li><span class="declaration"><code>n_strings = len(string_store)</code></span>Number of strings in the string-store
|
||
</li>
|
||
<li><span class="declaration"><code>for string in string_store</code></span>Iterate over strings in the string store, in order, such that the <em>i</em>th string in the sequence has the ID <em>i</em>:
|
||
<pre class="language-python"><code>for i, string in enumerate(string_store):
|
||
assert i == string_store[string]</code></pre>
|
||
</li>
|
||
</ul>
|
||
<details>
|
||
<summary>
|
||
<h4>Constructors</h4>
|
||
</summary>
|
||
<p><code>StringStore.__init__</code> takes no arguments, so a new instance can be constructed as follows:</p>
|
||
<pre class="language-python"><code>string_store = StringStore()</code></pre>
|
||
<p>However, in practice you'll usually use the instance owned by the language's <code>vocab</code> object, which all classes hold a reference to:</p>
|
||
<ul>
|
||
<li><code class="language-python">english.vocab.strings</code></li>
|
||
<li><code class="language-python">doc.vocab.strings</code></li>
|
||
<li><code class="language-python">span.vocab.strings</code></li>
|
||
<li><code class="language-python">token.vocab.strings</code></li>
|
||
<li><code class="language-python">lexeme.vocab.strings</code></li>
|
||
</ul>
|
||
<p>If you create another instance, it will map strings to different integers – which is usually not what you want.</p>
|
||
</details>
|
||
<details>
|
||
<summary>
|
||
<h4>Save and Load</h4>
|
||
</summary>
|
||
<details open="open">
|
||
<summary><a><span class="declaration"><code>dump</code><span class="parameters">loc</span></span></a></summary>
|
||
<p>Save the strings mapping to the given location, in plain text. The format is subject to change; so if you need to read/write compatible files, please can find details in the <code>strings.pyx</code> source.</p>
|
||
</details>
|
||
<details open="open">
|
||
<summary><a><span class="declaration"><code>load</code><span class="parameters">loc</span></span></a></summary>
|
||
<p>Load the strings mapping from a plain-text file in the given location. The format is subject to change; so if you need to read/write compatible files, please can find details in the <code>strings.pyx</code> source.</p>
|
||
</details>
|
||
</details>
|
||
</details> |