mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Update API documentation
This commit is contained in:
parent
3f4fd2c5d5
commit
808f7ee417
43
website/api/_annotation/_biluo.jade
Normal file
43
website/api/_annotation/_biluo.jade
Normal file
|
@ -0,0 +1,43 @@
|
|||
//- 💫 DOCS > API > ANNOTATION > BILUO
|
||||
|
||||
+table([ "Tag", "Description" ])
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme B] EGIN]
|
||||
+cell The first token of a multi-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme I] N]
|
||||
+cell An inner token of a multi-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme L] AST]
|
||||
+cell The final token of a multi-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme U] NIT]
|
||||
+cell A single-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme O] UT]
|
||||
+cell A non-entity token.
|
||||
|
||||
+aside("Why BILUO, not IOB?")
|
||||
| There are several coding schemes for encoding entity annotations as
|
||||
| token tags. These coding schemes are equally expressive, but not
|
||||
| necessarily equally learnable.
|
||||
| #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
|
||||
| showed that the minimal #[strong Begin], #[strong In], #[strong Out]
|
||||
| scheme was more difficult to learn than the #[strong BILUO] scheme that
|
||||
| we use, which explicitly marks boundary tokens.
|
||||
|
||||
p
|
||||
| spaCy translates the character offsets into this scheme, in order to
|
||||
| decide the cost of each action given the current state of the entity
|
||||
| recogniser. The costs are then used to calculate the gradient of the
|
||||
| loss, to train the model. The exact algorithm is a pastiche of
|
||||
| well-known methods, and is not currently described in any single
|
||||
| publication. The model is a greedy transition-based parser guided by a
|
||||
| linear model whose weights are learned using the averaged perceptron
|
||||
| loss, via the #[+a("http://www.aclweb.org/anthology/C12-1059") dynamic oracle]
|
||||
| imitation learning strategy. The transition system is equivalent to the
|
||||
| BILOU tagging scheme.
|
115
website/api/_architecture/_cython.jade
Normal file
115
website/api/_architecture/_cython.jade
Normal file
|
@ -0,0 +1,115 @@
|
|||
//- 💫 DOCS > API > ARCHITECTURE > CYTHON
|
||||
|
||||
+aside("What's Cython?")
|
||||
| #[+a("http://cython.org/") Cython] is a language for writing
|
||||
| C extensions for Python. Most Python code is also valid Cython, but
|
||||
| you can add type declarations to get efficient memory-managed code
|
||||
| just like C or C++.
|
||||
|
||||
p
|
||||
| spaCy's core data structures are implemented as
|
||||
| #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
|
||||
| managed through the #[+a(gh("cymem")) #[code cymem]]
|
||||
| #[code cymem.Pool] class, which allows you
|
||||
| to allocate memory which will be freed when the #[code Pool] object
|
||||
| is garbage collected. This means you usually don't have to worry
|
||||
| about freeing memory. You just have to decide which Python object
|
||||
| owns the memory, and make it own the #[code Pool]. When that object
|
||||
| goes out of scope, the memory will be freed. You do have to take
|
||||
| care that no pointers outlive the object that owns them — but this
|
||||
| is generally quite easy.
|
||||
|
||||
p
|
||||
| All Cython modules should have the #[code # cython: infer_types=True]
|
||||
| compiler directive at the top of the file. This makes the code much
|
||||
| cleaner, as it avoids the need for many type declarations. If
|
||||
| possible, you should prefer to declare your functions #[code nogil],
|
||||
| even if you don't especially care about multi-threading. The reason
|
||||
| is that #[code nogil] functions help the Cython compiler reason about
|
||||
| your code quite a lot — you're telling the compiler that no Python
|
||||
| dynamics are possible. This lets many errors be raised, and ensures
|
||||
| your function will run at C speed.
|
||||
|
||||
|
||||
p
|
||||
| Cython gives you many choices of sequences: you could have a Python
|
||||
| list, a numpy array, a memory view, a C++ vector, or a pointer.
|
||||
| Pointers are preferred, because they are fastest, have the most
|
||||
| explicit semantics, and let the compiler check your code more
|
||||
| strictly. C++ vectors are also great — but you should only use them
|
||||
| internally in functions. It's less friendly to accept a vector as an
|
||||
| argument, because that asks the user to do much more work. Here's
|
||||
| how to get a pointer from a numpy array, memory view or vector:
|
||||
|
||||
+code.
|
||||
cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
|
||||
pointer1 = <int*>numpy_array.data
|
||||
pointer2 = cpp_vector.data()
|
||||
pointer3 = &memory_view[0]
|
||||
|
||||
p
|
||||
| Both C arrays and C++ vectors reassure the compiler that no Python
|
||||
| operations are possible on your variable. This is a big advantage:
|
||||
| it lets the Cython compiler raise many more errors for you.
|
||||
|
||||
p
|
||||
| When getting a pointer from a numpy array or memoryview, take care
|
||||
| that the data is actually stored in C-contiguous order — otherwise
|
||||
| you'll get a pointer to nonsense. The type-declarations in the code
|
||||
| above should generate runtime errors if buffers with incorrect
|
||||
| memory layouts are passed in. To iterate over the array, the
|
||||
| following style is preferred:
|
||||
|
||||
+code.
|
||||
cdef int c_total(const int* int_array, int length) nogil:
|
||||
total = 0
|
||||
for item in int_array[:length]:
|
||||
total += item
|
||||
return total
|
||||
|
||||
p
|
||||
| If this is confusing, consider that the compiler couldn't deal with
|
||||
| #[code for item in int_array:] — there's no length attached to a raw
|
||||
| pointer, so how could we figure out where to stop? The length is
|
||||
| provided in the slice notation as a solution to this. Note that we
|
||||
| don't have to declare the type of #[code item] in the code above —
|
||||
| the compiler can easily infer it. This gives us tidy code that looks
|
||||
| quite like Python, but is exactly as fast as C — because we've made
|
||||
| sure the compilation to C is trivial.
|
||||
|
||||
p
|
||||
| Your functions cannot be declared #[code nogil] if they need to
|
||||
| create Python objects or call Python functions. This is perfectly
|
||||
| okay — you shouldn't torture your code just to get #[code nogil]
|
||||
| functions. However, if your function isn't #[code nogil], you should
|
||||
| compile your module with #[code cython -a --cplus my_module.pyx] and
|
||||
| open the resulting #[code my_module.html] file in a browser. This
|
||||
| will let you see how Cython is compiling your code. Calls into the
|
||||
| Python run-time will be in bright yellow. This lets you easily see
|
||||
| whether Cython is able to correctly type your code, or whether there
|
||||
| are unexpected problems.
|
||||
|
||||
p
|
||||
| Working in Cython is very rewarding once you're over the initial
|
||||
| learning curve. As with C and C++, the first way you write something
|
||||
| in Cython will often be the performance-optimal approach. In
|
||||
| contrast, Python optimisation generally requires a lot of
|
||||
| experimentation. Is it faster to have an #[code if item in my_dict]
|
||||
| check, or to use #[code .get()]? What about
|
||||
| #[code try]/#[code except]? Does this numpy operation create a copy?
|
||||
| There's no way to guess the answers to these questions, and you'll
|
||||
| usually be dissatisfied with your results — so there's no way to
|
||||
| know when to stop this process. In the worst case, you'll make a
|
||||
| mess that invites the next reader to try their luck too. This is
|
||||
| like one of those
|
||||
| #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
|
||||
| where the rescuers keep passing out from low oxygen, causing
|
||||
| another rescuer to follow — only to succumb themselves. In short,
|
||||
| just say no to optimizing your Python. If it's not fast enough the
|
||||
| first time, just switch to Cython.
|
||||
|
||||
+infobox("Resources")
|
||||
+list.o-no-block
|
||||
+item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
|
||||
+item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
|
||||
+item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)
|
141
website/api/_architecture/_nn-model.jade
Normal file
141
website/api/_architecture/_nn-model.jade
Normal file
|
@ -0,0 +1,141 @@
|
|||
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
|
||||
|
||||
p
|
||||
| The parsing model is a blend of recent results. The two recent
|
||||
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
|
||||
| Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
|
||||
| the parser is still based on the work of Joakim Nivre#[+fn(2)], who
|
||||
| introduced the transition-based framework#[+fn(3)], the arc-eager
|
||||
| transition system, and the imitation learning objective. The model is
|
||||
| implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
|
||||
| library. We first predict context-sensitive vectors for each word in the
|
||||
| input:
|
||||
|
||||
+code.
|
||||
(embed_lower | embed_prefix | embed_suffix | embed_shape)
|
||||
>> Maxout(token_width)
|
||||
>> convolution ** 4
|
||||
|
||||
p
|
||||
| This convolutional layer is shared between the tagger, parser and NER,
|
||||
| and will also be shared by the future neural lemmatizer. Because the
|
||||
| parser shares these layers with the tagger, the parser does not require
|
||||
| tag features. I got this trick from David Weiss's "Stack Combination"
|
||||
| paper#[+fn(4)].
|
||||
|
||||
p
|
||||
| To boost the representation, the tagger actually predicts a "super tag"
|
||||
| with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
|
||||
| these supertags by adding a softmax layer onto the convolutional layer –
|
||||
| so, we're teaching the convolutional layer to give us a representation
|
||||
| that's one affine transform from this informative lexical information.
|
||||
| This is obviously good for the parser (which backprops to the
|
||||
| convolutions too). The parser model makes a state vector by concatenating
|
||||
| the vector representations for its context tokens. The current context
|
||||
| tokens:
|
||||
|
||||
+table
|
||||
+row
|
||||
+cell #[code S0], #[code S1], #[code S2]
|
||||
+cell Top three words on the stack.
|
||||
|
||||
+row
|
||||
+cell #[code B0], #[code B1]
|
||||
+cell First two words of the buffer.
|
||||
|
||||
+row
|
||||
+cell.u-nowrap
|
||||
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
|
||||
| #[code B1L1]#[br]
|
||||
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
|
||||
| #[code B1L2]
|
||||
+cell
|
||||
| Leftmost and second leftmost children of #[code S0], #[code S1],
|
||||
| #[code S2], #[code B0] and #[code B1].
|
||||
|
||||
+row
|
||||
+cell.u-nowrap
|
||||
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
|
||||
| #[code B1R1]#[br]
|
||||
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
|
||||
| #[code B1R2]
|
||||
+cell
|
||||
| Rightmost and second rightmost children of #[code S0], #[code S1],
|
||||
| #[code S2], #[code B0] and #[code B1].
|
||||
|
||||
p
|
||||
| This makes the state vector quite long: #[code 13*T], where #[code T] is
|
||||
| the token vector width (128 is working well). Fortunately, there's a way
|
||||
| to structure the computation to save some expense (and make it more
|
||||
| GPU-friendly).
|
||||
|
||||
p
|
||||
| The parser typically visits #[code 2*N] states for a sentence of length
|
||||
| #[code N] (although it may visit more, if it back-tracks with a
|
||||
| non-monotonic transition#[+fn(4)]). A naive implementation would require
|
||||
| #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
|
||||
| size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
|
||||
| multiplication, to pre-compute the hidden weights for each positional
|
||||
| feature with respect to the words in the batch. (Note that our token
|
||||
| vectors come from the CNN — so we can't play this trick over the
|
||||
| vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
|
||||
| model is so big.)
|
||||
|
||||
p
|
||||
| This pre-computation strategy allows a nice compromise between
|
||||
| GPU-friendliness and implementation simplicity. The CNN and the wide
|
||||
| lower layer are computed on the GPU, and then the precomputed hidden
|
||||
| weights are moved to the CPU, before we start the transition-based
|
||||
| parsing process. This makes a lot of things much easier. We don't have to
|
||||
| worry about variable-length batch sizes, and we don't have to implement
|
||||
| the dynamic oracle in CUDA to train.
|
||||
|
||||
p
|
||||
| Currently the parser's loss function is multilabel log loss#[+fn(6)], as
|
||||
| the dynamic oracle allows multiple states to be 0 cost. This is defined
|
||||
| as follows, where #[code gZ] is the sum of the scores assigned to gold
|
||||
| classes:
|
||||
|
||||
+code.
|
||||
(exp(score) / Z) - (exp(score) / gZ)
|
||||
|
||||
+bibliography
|
||||
+item
|
||||
| #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
|
||||
br
|
||||
| Eliyahu Kiperwasser, Yoav Goldberg. (2016)
|
||||
|
||||
+item
|
||||
| #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
|
||||
br
|
||||
| Yoav Goldberg, Joakim Nivre (2012)
|
||||
|
||||
+item
|
||||
| #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
|
||||
br
|
||||
| Matthew Honnibal (2013)
|
||||
|
||||
+item
|
||||
| #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
|
||||
br
|
||||
| Yuan Zhang, David Weiss (2016)
|
||||
|
||||
+item
|
||||
| #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
|
||||
br
|
||||
| Anders Søgaard, Yoav Goldberg (2016)
|
||||
|
||||
+item
|
||||
| #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
|
||||
br
|
||||
| Matthew Honnibal, Mark Johnson (2015)
|
||||
|
||||
+item
|
||||
| #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
|
||||
br
|
||||
| Danqi Cheng, Christopher D. Manning (2014)
|
||||
|
||||
+item
|
||||
| #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
|
||||
br
|
||||
| Stefan Riezler et al. (2002)
|
|
@ -1,29 +1,32 @@
|
|||
{
|
||||
"sidebar": {
|
||||
"Introduction": {
|
||||
"Facts & Figures": "./",
|
||||
"Languages": "language-models",
|
||||
"Annotation Specs": "annotation"
|
||||
"Overview": {
|
||||
"Architecture": "./",
|
||||
"Annotation Specs": "annotation",
|
||||
"Functions": "top-level"
|
||||
},
|
||||
"Top-level": {
|
||||
"spacy": "spacy",
|
||||
"displacy": "displacy",
|
||||
"Utility Functions": "util",
|
||||
"Command line": "cli"
|
||||
},
|
||||
"Classes": {
|
||||
"Containers": {
|
||||
"Doc": "doc",
|
||||
"Token": "token",
|
||||
"Span": "span",
|
||||
"Lexeme": "lexeme"
|
||||
},
|
||||
|
||||
"Pipeline": {
|
||||
"Language": "language",
|
||||
"Tokenizer": "tokenizer",
|
||||
"Pipe": "pipe",
|
||||
"Tensorizer": "tensorizer",
|
||||
"Tagger": "tagger",
|
||||
"DependencyParser": "dependencyparser",
|
||||
"EntityRecognizer": "entityrecognizer",
|
||||
"TextCategorizer": "textcategorizer",
|
||||
"Tokenizer": "tokenizer",
|
||||
"Lemmatizer": "lemmatizer",
|
||||
"Matcher": "matcher",
|
||||
"Lexeme": "lexeme",
|
||||
"PhraseMatcher": "phrasematcher"
|
||||
},
|
||||
|
||||
"Other": {
|
||||
"Vocab": "vocab",
|
||||
"StringStore": "stringstore",
|
||||
"Vectors": "vectors",
|
||||
|
@ -34,52 +37,37 @@
|
|||
},
|
||||
|
||||
"index": {
|
||||
"title": "Facts & Figures",
|
||||
"next": "language-models"
|
||||
"title": "Architecture",
|
||||
"next": "annotation",
|
||||
"menu": {
|
||||
"Basics": "basics",
|
||||
"Neural Network Model": "nn-model",
|
||||
"Cython Conventions": "cython"
|
||||
}
|
||||
},
|
||||
|
||||
"language-models": {
|
||||
"title": "Languages",
|
||||
"next": "philosophy"
|
||||
},
|
||||
|
||||
"philosophy": {
|
||||
"title": "Philosophy"
|
||||
},
|
||||
|
||||
"spacy": {
|
||||
"title": "spaCy top-level functions",
|
||||
"source": "spacy/__init__.py",
|
||||
"next": "displacy"
|
||||
},
|
||||
|
||||
"displacy": {
|
||||
"title": "displaCy",
|
||||
"tag": "module",
|
||||
"source": "spacy/displacy",
|
||||
"next": "util"
|
||||
},
|
||||
|
||||
"util": {
|
||||
"title": "Utility Functions",
|
||||
"source": "spacy/util.py",
|
||||
"next": "cli"
|
||||
},
|
||||
|
||||
"cli": {
|
||||
"title": "Command Line Interface",
|
||||
"source": "spacy/cli"
|
||||
"top-level": {
|
||||
"title": "Top-level Functions",
|
||||
"menu": {
|
||||
"spacy": "spacy",
|
||||
"displacy": "displacy",
|
||||
"Utility Functions": "util",
|
||||
"Compatibility": "compat",
|
||||
"Command Line": "cli"
|
||||
}
|
||||
},
|
||||
|
||||
"language": {
|
||||
"title": "Language",
|
||||
"tag": "class",
|
||||
"teaser": "A text-processing pipeline.",
|
||||
"source": "spacy/language.py"
|
||||
},
|
||||
|
||||
"doc": {
|
||||
"title": "Doc",
|
||||
"tag": "class",
|
||||
"teaser": "A container for accessing linguistic annotations.",
|
||||
"source": "spacy/tokens/doc.pyx"
|
||||
},
|
||||
|
||||
|
@ -103,6 +91,7 @@
|
|||
|
||||
"vocab": {
|
||||
"title": "Vocab",
|
||||
"teaser": "A storage class for vocabulary and other data shared across a language.",
|
||||
"tag": "class",
|
||||
"source": "spacy/vocab.pyx"
|
||||
},
|
||||
|
@ -115,10 +104,27 @@
|
|||
|
||||
"matcher": {
|
||||
"title": "Matcher",
|
||||
"teaser": "Match sequences of tokens, based on pattern rules.",
|
||||
"tag": "class",
|
||||
"source": "spacy/matcher.pyx"
|
||||
},
|
||||
|
||||
"phrasematcher": {
|
||||
"title": "PhraseMatcher",
|
||||
"teaser": "Match sequences of tokens, based on documents.",
|
||||
"tag": "class",
|
||||
"tag_new": 2,
|
||||
"source": "spacy/matcher.pyx"
|
||||
},
|
||||
|
||||
"pipe": {
|
||||
"title": "Pipe",
|
||||
"teaser": "Abstract base class defining the API for pipeline components.",
|
||||
"tag": "class",
|
||||
"tag_new": 2,
|
||||
"source": "spacy/pipeline.pyx"
|
||||
},
|
||||
|
||||
"dependenyparser": {
|
||||
"title": "DependencyParser",
|
||||
"tag": "class",
|
||||
|
@ -127,18 +133,22 @@
|
|||
|
||||
"entityrecognizer": {
|
||||
"title": "EntityRecognizer",
|
||||
"teaser": "Annotate named entities on documents.",
|
||||
"tag": "class",
|
||||
"source": "spacy/pipeline.pyx"
|
||||
},
|
||||
|
||||
"textcategorizer": {
|
||||
"title": "TextCategorizer",
|
||||
"teaser": "Add text categorization models to spaCy pipelines.",
|
||||
"tag": "class",
|
||||
"tag_new": 2,
|
||||
"source": "spacy/pipeline.pyx"
|
||||
},
|
||||
|
||||
"dependencyparser": {
|
||||
"title": "DependencyParser",
|
||||
"teaser": "Annotate syntactic dependencies on documents.",
|
||||
"tag": "class",
|
||||
"source": "spacy/pipeline.pyx"
|
||||
},
|
||||
|
@ -149,15 +159,23 @@
|
|||
"source": "spacy/tokenizer.pyx"
|
||||
},
|
||||
|
||||
"lemmatizer": {
|
||||
"title": "Lemmatizer",
|
||||
"tag": "class"
|
||||
},
|
||||
|
||||
"tagger": {
|
||||
"title": "Tagger",
|
||||
"teaser": "Annotate part-of-speech tags on documents.",
|
||||
"tag": "class",
|
||||
"source": "spacy/pipeline.pyx"
|
||||
},
|
||||
|
||||
"tensorizer": {
|
||||
"title": "Tensorizer",
|
||||
"teaser": "Add a tensor with position-sensitive meaning representations to a document.",
|
||||
"tag": "class",
|
||||
"tag_new": 2,
|
||||
"source": "spacy/pipeline.pyx"
|
||||
},
|
||||
|
||||
|
@ -169,23 +187,38 @@
|
|||
|
||||
"goldcorpus": {
|
||||
"title": "GoldCorpus",
|
||||
"teaser": "An annotated corpus, using the JSON file format.",
|
||||
"tag": "class",
|
||||
"tag_new": 2,
|
||||
"source": "spacy/gold.pyx"
|
||||
},
|
||||
|
||||
"binder": {
|
||||
"title": "Binder",
|
||||
"tag": "class",
|
||||
"tag_new": 2,
|
||||
"source": "spacy/tokens/binder.pyx"
|
||||
},
|
||||
|
||||
"vectors": {
|
||||
"title": "Vectors",
|
||||
"teaser": "Store, save and load word vectors.",
|
||||
"tag": "class",
|
||||
"tag_new": 2,
|
||||
"source": "spacy/vectors.pyx"
|
||||
},
|
||||
|
||||
"annotation": {
|
||||
"title": "Annotation Specifications"
|
||||
"title": "Annotation Specifications",
|
||||
"teaser": "Schemes used for labels, tags and training data.",
|
||||
"menu": {
|
||||
"Tokenization": "tokenization",
|
||||
"Sentence Boundaries": "sbd",
|
||||
"POS Tagging": "pos-tagging",
|
||||
"Lemmatization": "lemmatization",
|
||||
"Dependencies": "dependency-parsing",
|
||||
"Named Entities": "named-entities",
|
||||
"Training Data": "training"
|
||||
}
|
||||
}
|
||||
}
|
|
@ -1,26 +1,17 @@
|
|||
//- 💫 DOCS > USAGE > COMMAND LINE INTERFACE
|
||||
|
||||
include ../../_includes/_mixins
|
||||
//- 💫 DOCS > API > TOP-LEVEL > COMMAND LINE INTERFACE
|
||||
|
||||
p
|
||||
| As of v1.7.0, spaCy comes with new command line helpers to download and
|
||||
| link models and show useful debugging information. For a list of available
|
||||
| commands, type #[code spacy --help].
|
||||
|
||||
+infobox("⚠️ Deprecation note")
|
||||
| As of spaCy 2.0, the #[code model] command to initialise a model data
|
||||
| directory is deprecated. The command was only necessary because previous
|
||||
| versions of spaCy expected a model directory to already be set up. This
|
||||
| has since been changed, so you can use the #[+api("cli#train") #[code train]]
|
||||
| command straight away.
|
||||
|
||||
+h(2, "download") Download
|
||||
+h(3, "download") Download
|
||||
|
||||
p
|
||||
| Download #[+a("/docs/usage/models") models] for spaCy. The downloader finds the
|
||||
| Download #[+a("/usage/models") models] for spaCy. The downloader finds the
|
||||
| best-matching compatible version, uses pip to download the model as a
|
||||
| package and automatically creates a
|
||||
| #[+a("/docs/usage/models#usage") shortcut link] to load the model by name.
|
||||
| #[+a("/usage/models#usage") shortcut link] to load the model by name.
|
||||
| Direct downloads don't perform any compatibility checks and require the
|
||||
| model name to be specified with its version (e.g., #[code en_core_web_sm-1.2.0]).
|
||||
|
||||
|
@ -49,15 +40,15 @@ p
|
|||
| detailed messages in case things go wrong. It's #[strong not recommended]
|
||||
| to use this command as part of an automated process. If you know which
|
||||
| model your project needs, you should consider a
|
||||
| #[+a("/docs/usage/models#download-pip") direct download via pip], or
|
||||
| #[+a("/usage/models#download-pip") direct download via pip], or
|
||||
| uploading the model to a local PyPi installation and fetching it straight
|
||||
| from there. This will also allow you to add it as a versioned package
|
||||
| dependency to your project.
|
||||
|
||||
+h(2, "link") Link
|
||||
+h(3, "link") Link
|
||||
|
||||
p
|
||||
| Create a #[+a("/docs/usage/models#usage") shortcut link] for a model,
|
||||
| Create a #[+a("/usage/models#usage") shortcut link] for a model,
|
||||
| either a Python package or a local directory. This will let you load
|
||||
| models from any location using a custom name via
|
||||
| #[+api("spacy#load") #[code spacy.load()]].
|
||||
|
@ -95,7 +86,7 @@ p
|
|||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+h(2, "info") Info
|
||||
+h(3, "info") Info
|
||||
|
||||
p
|
||||
| Print information about your spaCy installation, models and local setup,
|
||||
|
@ -122,15 +113,15 @@ p
|
|||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+h(2, "convert") Convert
|
||||
+h(3, "convert") Convert
|
||||
|
||||
p
|
||||
| Convert files into spaCy's #[+a("/docs/api/annotation#json-input") JSON format]
|
||||
| Convert files into spaCy's #[+a("/api/annotation#json-input") JSON format]
|
||||
| for use with the #[code train] command and other experiment management
|
||||
| functions. The right converter is chosen based on the file extension of
|
||||
| the input file. Currently only supports #[code .conllu].
|
||||
|
||||
+code(false, "bash", "$").
|
||||
+code(false, "bash", "$", false, false, true).
|
||||
spacy convert [input_file] [output_dir] [--n-sents] [--morphology]
|
||||
|
||||
+table(["Argument", "Type", "Description"])
|
||||
|
@ -159,14 +150,18 @@ p
|
|||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+h(2, "train") Train
|
||||
+h(3, "train") Train
|
||||
|
||||
p
|
||||
| Train a model. Expects data in spaCy's
|
||||
| #[+a("/docs/api/annotation#json-input") JSON format].
|
||||
| #[+a("/api/annotation#json-input") JSON format]. On each epoch, a model
|
||||
| will be saved out to the directory. Accuracy scores and model details
|
||||
| will be added to a #[+a("/usage/training#models-generating") #[code meta.json]]
|
||||
| to allow packaging the model using the
|
||||
| #[+api("cli#package") #[code package]] command.
|
||||
|
||||
+code(false, "bash", "$").
|
||||
spacy train [lang] [output_dir] [train_data] [dev_data] [--n-iter] [--n-sents] [--use-gpu] [--no-tagger] [--no-parser] [--no-entities]
|
||||
+code(false, "bash", "$", false, false, true).
|
||||
spacy train [lang] [output_dir] [train_data] [dev_data] [--n-iter] [--n-sents] [--use-gpu] [--meta-path] [--vectors] [--no-tagger] [--no-parser] [--no-entities] [--gold-preproc]
|
||||
|
||||
+table(["Argument", "Type", "Description"])
|
||||
+row
|
||||
|
@ -204,6 +199,27 @@ p
|
|||
+cell option
|
||||
+cell Use GPU.
|
||||
|
||||
+row
|
||||
+cell #[code --vectors], #[code -v]
|
||||
+cell option
|
||||
+cell Model to load vectors from.
|
||||
|
||||
+row
|
||||
+cell #[code --meta-path], #[code -m]
|
||||
+cell option
|
||||
+cell
|
||||
| #[+tag-new(2)] Optional path to model
|
||||
| #[+a("/usage/training#models-generating") #[code meta.json]].
|
||||
| All relevant properties like #[code lang], #[code pipeline] and
|
||||
| #[code spacy_version] will be overwritten.
|
||||
|
||||
+row
|
||||
+cell #[code --version], #[code -V]
|
||||
+cell option
|
||||
+cell
|
||||
| Model version. Will be written out to the model's
|
||||
| #[code meta.json] after training.
|
||||
|
||||
+row
|
||||
+cell #[code --no-tagger], #[code -T]
|
||||
+cell flag
|
||||
|
@ -219,12 +235,18 @@ p
|
|||
+cell flag
|
||||
+cell Don't train NER.
|
||||
|
||||
+row
|
||||
+cell #[code --gold-preproc], #[code -G]
|
||||
+cell flag
|
||||
+cell Use gold preprocessing.
|
||||
|
||||
+row
|
||||
+cell #[code --help], #[code -h]
|
||||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+h(3, "train-hyperparams") Environment variables for hyperparameters
|
||||
+h(4, "train-hyperparams") Environment variables for hyperparameters
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| spaCy lets you set hyperparameters for training via environment variables.
|
||||
|
@ -236,98 +258,96 @@ p
|
|||
+code(false, "bash").
|
||||
parser_hidden_depth=2 parser_maxout_pieces=1 train-parser
|
||||
|
||||
+under-construction
|
||||
|
||||
+table(["Name", "Description", "Default"])
|
||||
+row
|
||||
+cell #[code dropout_from]
|
||||
+cell
|
||||
+cell Initial dropout rate.
|
||||
+cell #[code 0.2]
|
||||
|
||||
+row
|
||||
+cell #[code dropout_to]
|
||||
+cell
|
||||
+cell Final dropout rate.
|
||||
+cell #[code 0.2]
|
||||
|
||||
+row
|
||||
+cell #[code dropout_decay]
|
||||
+cell
|
||||
+cell Rate of dropout change.
|
||||
+cell #[code 0.0]
|
||||
|
||||
+row
|
||||
+cell #[code batch_from]
|
||||
+cell
|
||||
+cell Initial batch size.
|
||||
+cell #[code 1]
|
||||
|
||||
+row
|
||||
+cell #[code batch_to]
|
||||
+cell
|
||||
+cell Final batch size.
|
||||
+cell #[code 64]
|
||||
|
||||
+row
|
||||
+cell #[code batch_compound]
|
||||
+cell
|
||||
+cell Rate of batch size acceleration.
|
||||
+cell #[code 1.001]
|
||||
|
||||
+row
|
||||
+cell #[code token_vector_width]
|
||||
+cell
|
||||
+cell Width of embedding tables and convolutional layers.
|
||||
+cell #[code 128]
|
||||
|
||||
+row
|
||||
+cell #[code embed_size]
|
||||
+cell
|
||||
+cell Number of rows in embedding tables.
|
||||
+cell #[code 7500]
|
||||
|
||||
+row
|
||||
+cell #[code parser_maxout_pieces]
|
||||
+cell
|
||||
+cell Number of pieces in the parser's and NER's first maxout layer.
|
||||
+cell #[code 2]
|
||||
|
||||
+row
|
||||
+cell #[code parser_hidden_depth]
|
||||
+cell
|
||||
+cell Number of hidden layers in the parser and NER.
|
||||
+cell #[code 1]
|
||||
|
||||
+row
|
||||
+cell #[code hidden_width]
|
||||
+cell
|
||||
+cell Size of the parser's and NER's hidden layers.
|
||||
+cell #[code 128]
|
||||
|
||||
+row
|
||||
+cell #[code learn_rate]
|
||||
+cell
|
||||
+cell Learning rate.
|
||||
+cell #[code 0.001]
|
||||
|
||||
+row
|
||||
+cell #[code optimizer_B1]
|
||||
+cell
|
||||
+cell Momentum for the Adam solver.
|
||||
+cell #[code 0.9]
|
||||
|
||||
+row
|
||||
+cell #[code optimizer_B2]
|
||||
+cell
|
||||
+cell Adagrad-momentum for the Adam solver.
|
||||
+cell #[code 0.999]
|
||||
|
||||
+row
|
||||
+cell #[code optimizer_eps]
|
||||
+cell
|
||||
+cell Epsylon value for the Adam solver.
|
||||
+cell #[code 1e-08]
|
||||
|
||||
+row
|
||||
+cell #[code L2_penalty]
|
||||
+cell
|
||||
+cell L2 regularisation penalty.
|
||||
+cell #[code 1e-06]
|
||||
|
||||
+row
|
||||
+cell #[code grad_norm_clip]
|
||||
+cell
|
||||
+cell Gradient L2 norm constraint.
|
||||
+cell #[code 1.0]
|
||||
|
||||
+h(2, "package") Package
|
||||
+h(3, "package") Package
|
||||
|
||||
p
|
||||
| Generate a #[+a("/docs/usage/saving-loading#generating") model Python package]
|
||||
| Generate a #[+a("/usage/training#models-generating") model Python package]
|
||||
| from an existing model data directory. All data files are copied over.
|
||||
| If the path to a meta.json is supplied, or a meta.json is found in the
|
||||
| input directory, this file is used. Otherwise, the data can be entered
|
||||
|
@ -336,8 +356,8 @@ p
|
|||
| sure you're always using the latest versions. This means you need to be
|
||||
| connected to the internet to use this command.
|
||||
|
||||
+code(false, "bash", "$").
|
||||
spacy package [input_dir] [output_dir] [--meta] [--force]
|
||||
+code(false, "bash", "$", false, false, true).
|
||||
spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force]
|
||||
|
||||
+table(["Argument", "Type", "Description"])
|
||||
+row
|
||||
|
@ -353,14 +373,14 @@ p
|
|||
+row
|
||||
+cell #[code --meta-path], #[code -m]
|
||||
+cell option
|
||||
+cell Path to meta.json file (optional).
|
||||
+cell #[+tag-new(2)] Path to meta.json file (optional).
|
||||
|
||||
+row
|
||||
+cell #[code --create-meta], #[code -c]
|
||||
+cell flag
|
||||
+cell
|
||||
| Create a meta.json file on the command line, even if one already
|
||||
| exists in the directory.
|
||||
| #[+tag-new(2)] Create a meta.json file on the command line, even
|
||||
| if one already exists in the directory.
|
||||
|
||||
+row
|
||||
+cell #[code --force], #[code -f]
|
91
website/api/_top-level/_compat.jade
Normal file
91
website/api/_top-level/_compat.jade
Normal file
|
@ -0,0 +1,91 @@
|
|||
//- 💫 DOCS > API > TOP-LEVEL > COMPATIBILITY
|
||||
|
||||
p
|
||||
| All Python code is written in an
|
||||
| #[strong intersection of Python 2 and Python 3]. This is easy in Cython,
|
||||
| but somewhat ugly in Python. Logic that deals with Python or platform
|
||||
| compatibility only lives in #[code spacy.compat]. To distinguish them from
|
||||
| the builtin functions, replacement functions are suffixed with an
|
||||
| undersocre, e.e #[code unicode_]. For specific checks, spaCy uses the
|
||||
| #[code six] and #[code ftfy] packages.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.compat import unicode_, json_dumps
|
||||
|
||||
compatible_unicode = unicode_('hello world')
|
||||
compatible_json = json_dumps({'key': 'value'})
|
||||
|
||||
+table(["Name", "Python 2", "Python 3"])
|
||||
+row
|
||||
+cell #[code compat.bytes_]
|
||||
+cell #[code str]
|
||||
+cell #[code bytes]
|
||||
|
||||
+row
|
||||
+cell #[code compat.unicode_]
|
||||
+cell #[code unicode]
|
||||
+cell #[code str]
|
||||
|
||||
+row
|
||||
+cell #[code compat.basestring_]
|
||||
+cell #[code basestring]
|
||||
+cell #[code str]
|
||||
|
||||
+row
|
||||
+cell #[code compat.input_]
|
||||
+cell #[code raw_input]
|
||||
+cell #[code input]
|
||||
|
||||
+row
|
||||
+cell #[code compat.json_dumps]
|
||||
+cell #[code ujson.dumps] with #[code .decode('utf8')]
|
||||
+cell #[code ujson.dumps]
|
||||
|
||||
+row
|
||||
+cell #[code compat.path2str]
|
||||
+cell #[code str(path)] with #[code .decode('utf8')]
|
||||
+cell #[code str(path)]
|
||||
|
||||
+h(3, "is_config") compat.is_config
|
||||
+tag function
|
||||
|
||||
p
|
||||
| Check if a specific configuration of Python version and operating system
|
||||
| matches the user's setup. Mostly used to display targeted error messages.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.compat import is_config
|
||||
|
||||
if is_config(python2=True, windows=True):
|
||||
print("You are using Python 2 on Windows.")
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code python2]
|
||||
+cell bool
|
||||
+cell spaCy is executed with Python 2.x.
|
||||
|
||||
+row
|
||||
+cell #[code python3]
|
||||
+cell bool
|
||||
+cell spaCy is executed with Python 3.x.
|
||||
|
||||
+row
|
||||
+cell #[code windows]
|
||||
+cell bool
|
||||
+cell spaCy is executed on Windows.
|
||||
|
||||
+row
|
||||
+cell #[code linux]
|
||||
+cell bool
|
||||
+cell spaCy is executed on Linux.
|
||||
|
||||
+row
|
||||
+cell #[code osx]
|
||||
+cell bool
|
||||
+cell spaCy is executed on OS X or macOS.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the specified configuration matches the user's platform.
|
|
@ -1,14 +1,12 @@
|
|||
//- 💫 DOCS > API > DISPLACY
|
||||
|
||||
include ../../_includes/_mixins
|
||||
//- 💫 DOCS > API > TOP-LEVEL > DISPLACY
|
||||
|
||||
p
|
||||
| As of v2.0, spaCy comes with a built-in visualization suite. For more
|
||||
| info and examples, see the usage guide on
|
||||
| #[+a("/docs/usage/visualizers") visualizing spaCy].
|
||||
| #[+a("/usage/visualizers") visualizing spaCy].
|
||||
|
||||
|
||||
+h(2, "serve") displacy.serve
|
||||
+h(3, "displacy.serve") displacy.serve
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -60,7 +58,7 @@ p
|
|||
+cell bool
|
||||
+cell
|
||||
| Don't parse #[code Doc] and instead, expect a dict or list of
|
||||
| dicts. #[+a("/docs/usage/visualizers#manual-usage") See here]
|
||||
| dicts. #[+a("/usage/visualizers#manual-usage") See here]
|
||||
| for formats and examples.
|
||||
+cell #[code False]
|
||||
|
||||
|
@ -70,7 +68,7 @@ p
|
|||
+cell Port to serve visualization.
|
||||
+cell #[code 5000]
|
||||
|
||||
+h(2, "render") displacy.render
|
||||
+h(3, "displacy.render") displacy.render
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -127,24 +125,24 @@ p Render a dependency parse tree or named entity visualization.
|
|||
+cell bool
|
||||
+cell
|
||||
| Don't parse #[code Doc] and instead, expect a dict or list of
|
||||
| dicts. #[+a("/docs/usage/visualizers#manual-usage") See here]
|
||||
| dicts. #[+a("/usage/visualizers#manual-usage") See here]
|
||||
| for formats and examples.
|
||||
+cell #[code False]
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell unicode
|
||||
+cell Rendered HTML markup.
|
||||
+cell
|
||||
|
||||
+h(2, "options") Visualizer options
|
||||
+h(3, "displacy_options") Visualizer options
|
||||
|
||||
p
|
||||
| The #[code options] argument lets you specify additional settings for
|
||||
| each visualizer. If a setting is not present in the options, the default
|
||||
| value will be used.
|
||||
|
||||
+h(3, "options-dep") Dependency Visualizer options
|
||||
+h(4, "options-dep") Dependency Visualizer options
|
||||
|
||||
+aside-code("Example").
|
||||
options = {'compact': True, 'color': 'blue'}
|
||||
|
@ -219,7 +217,7 @@ p
|
|||
+cell Distance between words in px.
|
||||
+cell #[code 175] / #[code 85] (compact)
|
||||
|
||||
+h(3, "options-ent") Named Entity Visualizer options
|
||||
+h(4, "displacy_options-ent") Named Entity Visualizer options
|
||||
|
||||
+aside-code("Example").
|
||||
options = {'ents': ['PERSON', 'ORG', 'PRODUCT'],
|
||||
|
@ -244,6 +242,6 @@ p
|
|||
|
||||
p
|
||||
| By default, displaCy comes with colours for all
|
||||
| #[+a("/docs/api/annotation#named-entities") entity types supported by spaCy].
|
||||
| #[+a("/api/annotation#named-entities") entity types supported by spaCy].
|
||||
| If you're using custom entity types, you can use the #[code colors]
|
||||
| setting to add your own colours for them.
|
|
@ -1,15 +1,13 @@
|
|||
//- 💫 DOCS > API > SPACY
|
||||
//- 💫 DOCS > API > TOP-LEVEL > SPACY
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
+h(2, "load") spacy.load
|
||||
+h(3, "spacy.load") spacy.load
|
||||
+tag function
|
||||
+tag-model
|
||||
|
||||
p
|
||||
| Load a model via its #[+a("/docs/usage/models#usage") shortcut link],
|
||||
| Load a model via its #[+a("/usage/models#usage") shortcut link],
|
||||
| the name of an installed
|
||||
| #[+a("/docs/usage/saving-loading#generating") model package], a unicode
|
||||
| #[+a("/usage/training#models-generating") model package], a unicode
|
||||
| path or a #[code Path]-like object. spaCy will try resolving the load
|
||||
| argument in this order. If a model is loaded from a shortcut link or
|
||||
| package name, spaCy will assume it's a Python package and import it and
|
||||
|
@ -38,25 +36,57 @@ p
|
|||
+cell list
|
||||
+cell
|
||||
| Names of pipeline components to
|
||||
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
|
||||
| #[+a("/usage/processing-pipelines#disabling") disable].
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Language]
|
||||
+cell A #[code Language] object with the loaded model.
|
||||
|
||||
+infobox("⚠️ Deprecation note")
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| As of spaCy 2.0, the #[code path] keyword argument is deprecated. spaCy
|
||||
| will also raise an error if no model could be loaded and never just
|
||||
| return an empty #[code Language] object. If you need a blank language,
|
||||
| you need to import it explicitly (#[code from spacy.lang.en import English])
|
||||
| or use #[+api("util#get_lang_class") #[code util.get_lang_class]].
|
||||
| you can use the new function #[+api("spacy#blank") #[code spacy.blank()]]
|
||||
| or import the class explicitly, e.g.
|
||||
| #[code from spacy.lang.en import English].
|
||||
|
||||
+code-new nlp = spacy.load('/model')
|
||||
+code-old nlp = spacy.load('en', path='/model')
|
||||
|
||||
+h(2, "info") spacy.info
|
||||
+h(3, "spacy.blank") spacy.blank
|
||||
+tag function
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Create a blank model of a given language class. This function is the
|
||||
| twin of #[code spacy.load()].
|
||||
|
||||
+aside-code("Example").
|
||||
nlp_en = spacy.blank('en')
|
||||
nlp_de = spacy.blank('de')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code name]
|
||||
+cell unicode
|
||||
+cell ISO code of the language class to load.
|
||||
|
||||
+row
|
||||
+cell #[code disable]
|
||||
+cell list
|
||||
+cell
|
||||
| Names of pipeline components to
|
||||
| #[+a("/usage/processing-pipelines#disabling") disable].
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Language]
|
||||
+cell An empty #[code Language] object of the appropriate subclass.
|
||||
|
||||
|
||||
+h(4, "spacy.info") spacy.info
|
||||
+tag function
|
||||
|
||||
p
|
||||
|
@ -83,13 +113,13 @@ p
|
|||
+cell Print information as Markdown.
|
||||
|
||||
|
||||
+h(2, "explain") spacy.explain
|
||||
+h(3, "spacy.explain") spacy.explain
|
||||
+tag function
|
||||
|
||||
p
|
||||
| Get a description for a given POS tag, dependency label or entity type.
|
||||
| For a list of available terms, see
|
||||
| #[+src(gh("spacy", "spacy/glossary.py")) glossary.py].
|
||||
| #[+src(gh("spacy", "spacy/glossary.py")) #[code glossary.py]].
|
||||
|
||||
+aside-code("Example").
|
||||
spacy.explain('NORP')
|
||||
|
@ -107,18 +137,18 @@ p
|
|||
+cell unicode
|
||||
+cell Term to explain.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell unicode
|
||||
+cell The explanation, or #[code None] if not found in the glossary.
|
||||
|
||||
+h(2, "set_factory") spacy.set_factory
|
||||
+h(3, "spacy.set_factory") spacy.set_factory
|
||||
+tag function
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Set a factory that returns a custom
|
||||
| #[+a("/docs/usage/language-processing-pipeline") processing pipeline]
|
||||
| #[+a("/usage/processing-pipelines") processing pipeline]
|
||||
| component. Factories are useful for creating stateful components, especially ones which depend on shared data.
|
||||
|
||||
+aside-code("Example").
|
|
@ -1,10 +1,8 @@
|
|||
//- 💫 DOCS > API > UTIL
|
||||
|
||||
include ../../_includes/_mixins
|
||||
//- 💫 DOCS > API > TOP-LEVEL > UTIL
|
||||
|
||||
p
|
||||
| spaCy comes with a small collection of utility functions located in
|
||||
| #[+src(gh("spaCy", "spacy/util.py")) spacy/util.py].
|
||||
| #[+src(gh("spaCy", "spacy/util.py")) #[code spacy/util.py]].
|
||||
| Because utility functions are mostly intended for
|
||||
| #[strong internal use within spaCy], their behaviour may change with
|
||||
| future releases. The functions documented on this page should be safe
|
||||
|
@ -12,7 +10,7 @@ p
|
|||
| recommend having additional tests in place if your application depends on
|
||||
| any of spaCy's utilities.
|
||||
|
||||
+h(2, "get_data_path") util.get_data_path
|
||||
+h(3, "util.get_data_path") util.get_data_path
|
||||
+tag function
|
||||
|
||||
p
|
||||
|
@ -25,12 +23,12 @@ p
|
|||
+cell bool
|
||||
+cell Only return path if it exists, otherwise return #[code None].
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Path] / #[code None]
|
||||
+cell Data path or #[code None].
|
||||
|
||||
+h(2, "set_data_path") util.set_data_path
|
||||
+h(3, "util.set_data_path") util.set_data_path
|
||||
+tag function
|
||||
|
||||
p
|
||||
|
@ -47,12 +45,12 @@ p
|
|||
+cell unicode or #[code Path]
|
||||
+cell Path to new data directory.
|
||||
|
||||
+h(2, "get_lang_class") util.get_lang_class
|
||||
+h(3, "util.get_lang_class") util.get_lang_class
|
||||
+tag function
|
||||
|
||||
p
|
||||
| Import and load a #[code Language] class. Allows lazy-loading
|
||||
| #[+a("/docs/usage/adding-languages") language data] and importing
|
||||
| #[+a("/usage/adding-languages") language data] and importing
|
||||
| languages using the two-letter language code.
|
||||
|
||||
+aside-code("Example").
|
||||
|
@ -67,12 +65,12 @@ p
|
|||
+cell unicode
|
||||
+cell Two-letter language code, e.g. #[code 'en'].
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Language]
|
||||
+cell Language class.
|
||||
|
||||
+h(2, "load_model") util.load_model
|
||||
+h(3, "util.load_model") util.load_model
|
||||
+tag function
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -101,12 +99,12 @@ p
|
|||
+cell -
|
||||
+cell Specific overrides, like pipeline components to disable.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Language]
|
||||
+cell #[code Language] class with the loaded model.
|
||||
|
||||
+h(2, "load_model_from_path") util.load_model_from_path
|
||||
+h(3, "util.load_model_from_path") util.load_model_from_path
|
||||
+tag function
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -139,18 +137,18 @@ p
|
|||
+cell -
|
||||
+cell Specific overrides, like pipeline components to disable.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Language]
|
||||
+cell #[code Language] class with the loaded model.
|
||||
|
||||
+h(2, "load_model_from_init_py") util.load_model_from_init_py
|
||||
+h(3, "util.load_model_from_init_py") util.load_model_from_init_py
|
||||
+tag function
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| A helper function to use in the #[code load()] method of a model package's
|
||||
| #[+src(gh("spacy-dev-resources", "templates/model/en_model_name/__init__.py")) __init__.py].
|
||||
| #[+src(gh("spacy-dev-resources", "templates/model/en_model_name/__init__.py")) #[code __init__.py]].
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.util import load_model_from_init_py
|
||||
|
@ -169,12 +167,12 @@ p
|
|||
+cell -
|
||||
+cell Specific overrides, like pipeline components to disable.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Language]
|
||||
+cell #[code Language] class with the loaded model.
|
||||
|
||||
+h(2, "get_model_meta") util.get_model_meta
|
||||
+h(3, "util.get_model_meta") util.get_model_meta
|
||||
+tag function
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -190,17 +188,17 @@ p
|
|||
+cell unicode or #[code Path]
|
||||
+cell Path to model directory.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell dict
|
||||
+cell The model's meta data.
|
||||
|
||||
+h(2, "is_package") util.is_package
|
||||
+h(3, "util.is_package") util.is_package
|
||||
+tag function
|
||||
|
||||
p
|
||||
| Check if string maps to a package installed via pip. Mainly used to
|
||||
| validate #[+a("/docs/usage/models") model packages].
|
||||
| validate #[+a("/usage/models") model packages].
|
||||
|
||||
+aside-code("Example").
|
||||
util.is_package('en_core_web_sm') # True
|
||||
|
@ -212,18 +210,18 @@ p
|
|||
+cell unicode
|
||||
+cell Name of package.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code bool]
|
||||
+cell #[code True] if installed package, #[code False] if not.
|
||||
|
||||
+h(2, "get_package_path") util.get_package_path
|
||||
+h(3, "util.get_package_path") util.get_package_path
|
||||
+tag function
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Get path to an installed package. Mainly used to resolve the location of
|
||||
| #[+a("/docs/usage/models") model packages]. Currently imports the package
|
||||
| #[+a("/usage/models") model packages]. Currently imports the package
|
||||
| to find its path.
|
||||
|
||||
+aside-code("Example").
|
||||
|
@ -236,12 +234,12 @@ p
|
|||
+cell unicode
|
||||
+cell Name of installed package.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Path]
|
||||
+cell Path to model package directory.
|
||||
|
||||
+h(2, "is_in_jupyter") util.is_in_jupyter
|
||||
+h(3, "util.is_in_jupyter") util.is_in_jupyter
|
||||
+tag function
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -257,17 +255,17 @@ p
|
|||
return display(HTML(html))
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell #[code True] if in Jupyter, #[code False] if not.
|
||||
|
||||
+h(2, "update_exc") util.update_exc
|
||||
+h(3, "util.update_exc") util.update_exc
|
||||
+tag function
|
||||
|
||||
p
|
||||
| Update, validate and overwrite
|
||||
| #[+a("/docs/usage/adding-languages#tokenizer-exceptions") tokenizer exceptions].
|
||||
| #[+a("/usage/adding-languages#tokenizer-exceptions") tokenizer exceptions].
|
||||
| Used to combine global exceptions with custom, language-specific
|
||||
| exceptions. Will raise an error if key doesn't match #[code ORTH] values.
|
||||
|
||||
|
@ -288,20 +286,20 @@ p
|
|||
+cell dicts
|
||||
+cell Exception dictionaries to add to the base exceptions, in order.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell dict
|
||||
+cell Combined tokenizer exceptions.
|
||||
|
||||
|
||||
+h(2, "prints") util.prints
|
||||
+h(3, "util.prints") util.prints
|
||||
+tag function
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Print a formatted, text-wrapped message with optional title. If a text
|
||||
| argument is a #[code Path], it's converted to a string. Should only
|
||||
| be used for interactive components like the #[+api("cli") cli].
|
||||
| be used for interactive components like the command-line interface.
|
||||
|
||||
+aside-code("Example").
|
||||
data_path = Path('/some/path')
|
131
website/api/annotation.jade
Normal file
131
website/api/annotation.jade
Normal file
|
@ -0,0 +1,131 @@
|
|||
//- 💫 DOCS > API > ANNOTATION SPECS
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
p This document describes the target annotations spaCy is trained to predict.
|
||||
|
||||
|
||||
+section("tokenization")
|
||||
+h(2, "tokenization") Tokenization
|
||||
|
||||
p
|
||||
| Tokenization standards are based on the
|
||||
| #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus.
|
||||
| The tokenizer differs from most by including tokens for significant
|
||||
| whitespace. Any sequence of whitespace characters beyond a single space
|
||||
| (#[code ' ']) is included as a token.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.lang.en import English
|
||||
nlp = English()
|
||||
tokens = nlp('Some\nspaces and\ttab characters')
|
||||
tokens_text = [t.text for t in tokens]
|
||||
assert tokens_text == ['Some', '\n', 'spaces', ' ', 'and',
|
||||
'\t', 'tab', 'characters']
|
||||
|
||||
p
|
||||
| The whitespace tokens are useful for much the same reason punctuation is
|
||||
| – it's often an important delimiter in the text. By preserving it in the
|
||||
| token output, we are able to maintain a simple alignment between the
|
||||
| tokens and the original string, and we ensure that no information is
|
||||
| lost during processing.
|
||||
|
||||
+section("sbd")
|
||||
+h(2, "sentence-boundary") Sentence boundary detection
|
||||
|
||||
p
|
||||
| Sentence boundaries are calculated from the syntactic parse tree, so
|
||||
| features such as punctuation and capitalisation play an important but
|
||||
| non-decisive role in determining the sentence boundaries. Usually this
|
||||
| means that the sentence boundaries will at least coincide with clause
|
||||
| boundaries, even given poorly punctuated text.
|
||||
|
||||
+section("pos-tagging")
|
||||
+h(2, "pos-tagging") Part-of-speech Tagging
|
||||
|
||||
+aside("Tip: Understanding tags")
|
||||
| You can also use #[code spacy.explain()] to get the description for the
|
||||
| string representation of a tag. For example,
|
||||
| #[code spacy.explain("RB")] will return "adverb".
|
||||
|
||||
include _annotation/_pos-tags
|
||||
|
||||
+section("lemmatization")
|
||||
+h(2, "lemmatization") Lemmatization
|
||||
|
||||
p A "lemma" is the uninflected form of a word. In English, this means:
|
||||
|
||||
+list
|
||||
+item #[strong Adjectives]: The form like "happy", not "happier" or "happiest"
|
||||
+item #[strong Adverbs]: The form like "badly", not "worse" or "worst"
|
||||
+item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
|
||||
+item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
|
||||
|
||||
p
|
||||
| The lemmatization data is taken from
|
||||
| #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
|
||||
| special case for pronouns: all pronouns are lemmatized to the special
|
||||
| token #[code -PRON-].
|
||||
|
||||
+infobox("About spaCy's custom pronoun lemma")
|
||||
| Unlike verbs and common nouns, there's no clear base form of a personal
|
||||
| pronoun. Should the lemma of "me" be "I", or should we normalize person
|
||||
| as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
|
||||
| novel symbol, #[code -PRON-], which is used as the lemma for
|
||||
| all personal pronouns.
|
||||
|
||||
+section("dependency-parsing")
|
||||
+h(2, "dependency-parsing") Syntactic Dependency Parsing
|
||||
|
||||
+aside("Tip: Understanding labels")
|
||||
| You can also use #[code spacy.explain()] to get the description for the
|
||||
| string representation of a label. For example,
|
||||
| #[code spacy.explain("prt")] will return "particle".
|
||||
|
||||
include _annotation/_dep-labels
|
||||
|
||||
+section("named-entities")
|
||||
+h(2, "named-entities") Named Entity Recognition
|
||||
|
||||
+aside("Tip: Understanding entity types")
|
||||
| You can also use #[code spacy.explain()] to get the description for the
|
||||
| string representation of an entity label. For example,
|
||||
| #[code spacy.explain("LANGUAGE")] will return "any named language".
|
||||
|
||||
include _annotation/_named-entities
|
||||
|
||||
+h(3, "biluo") BILUO Scheme
|
||||
|
||||
include _annotation/_biluo
|
||||
|
||||
+section("training")
|
||||
+h(2, "json-input") JSON input format for training
|
||||
|
||||
+under-construction
|
||||
|
||||
p spaCy takes training data in the following format:
|
||||
|
||||
+code("Example structure").
|
||||
doc: {
|
||||
id: string,
|
||||
paragraphs: [{
|
||||
raw: string,
|
||||
sents: [int],
|
||||
tokens: [{
|
||||
start: int,
|
||||
tag: string,
|
||||
head: int,
|
||||
dep: string
|
||||
}],
|
||||
ner: [{
|
||||
start: int,
|
||||
end: int,
|
||||
label: string
|
||||
}],
|
||||
brackets: [{
|
||||
start: int,
|
||||
end: int,
|
||||
label: string
|
||||
}]
|
||||
}]
|
||||
}
|
|
@ -1,6 +1,6 @@
|
|||
//- 💫 DOCS > API > BINDER
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p A container class for serializing collections of #[code Doc] objects.
|
||||
|
5
website/api/dependencyparser.jade
Normal file
5
website/api/dependencyparser.jade
Normal file
|
@ -0,0 +1,5 @@
|
|||
//- 💫 DOCS > API > DEPENDENCYPARSER
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
!=partial("pipe", { subclass: "DependencyParser", short: "parser", pipeline_id: "parser" })
|
|
@ -1,8 +1,6 @@
|
|||
//- 💫 DOCS > API > DOC
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p A container for accessing linguistic annotations.
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| A #[code Doc] is a sequence of #[+api("token") #[code Token]] objects.
|
||||
|
@ -47,7 +45,7 @@ p
|
|||
| subsequent space. Must have the same length as #[code words], if
|
||||
| specified. Defaults to a sequence of #[code True].
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Doc]
|
||||
+cell The newly constructed object.
|
||||
|
@ -73,7 +71,7 @@ p
|
|||
+cell int
|
||||
+cell The index of the token.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Token]
|
||||
+cell The token at #[code doc[i]].
|
||||
|
@ -96,7 +94,7 @@ p
|
|||
+cell tuple
|
||||
+cell The slice of the document to get.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Span]
|
||||
+cell The span at #[code doc[start : end]].
|
||||
|
@ -120,7 +118,7 @@ p
|
|||
| from Cython.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Token]
|
||||
+cell A #[code Token] object.
|
||||
|
@ -135,7 +133,7 @@ p Get the number of tokens in the document.
|
|||
assert len(doc) == 7
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of tokens in the document.
|
||||
|
@ -172,7 +170,7 @@ p Create a #[code Span] object from the slice #[code doc.text[start : end]].
|
|||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell A meaning representation of the span.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Span]
|
||||
+cell The newly constructed object.
|
||||
|
@ -200,7 +198,7 @@ p
|
|||
| The object to compare with. By default, accepts #[code Doc],
|
||||
| #[code Span], #[code Token] and #[code Lexeme] objects.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell float
|
||||
+cell A scalar similarity score. Higher is more similar.
|
||||
|
@ -226,7 +224,7 @@ p
|
|||
+cell int
|
||||
+cell The attribute ID
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell dict
|
||||
+cell A dictionary mapping attributes to integer counts.
|
||||
|
@ -251,7 +249,7 @@ p
|
|||
+cell list
|
||||
+cell A list of attribute ID ints.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code.u-break numpy.ndarray[ndim=2, dtype='int32']]
|
||||
+cell
|
||||
|
@ -285,7 +283,7 @@ p
|
|||
+cell #[code.u-break numpy.ndarray[ndim=2, dtype='int32']]
|
||||
+cell The attribute values to load.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Doc]
|
||||
+cell Itself.
|
||||
|
@ -326,7 +324,7 @@ p Loads state from a directory. Modifies the object in place and returns it.
|
|||
| A path to a directory. Paths may be either strings or
|
||||
| #[code Path]-like objects.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Doc]
|
||||
+cell The modified #[code Doc] object.
|
||||
|
@ -341,7 +339,7 @@ p Serialize, i.e. export the document contents to a binary string.
|
|||
doc_bytes = doc.to_bytes()
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bytes
|
||||
+cell
|
||||
|
@ -367,7 +365,7 @@ p Deserialize, i.e. import the document contents from a binary string.
|
|||
+cell bytes
|
||||
+cell The string to load from.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Doc]
|
||||
+cell The #[code Doc] object.
|
||||
|
@ -378,7 +376,7 @@ p Deserialize, i.e. import the document contents from a binary string.
|
|||
p
|
||||
| Retokenize the document, such that the span at
|
||||
| #[code doc.text[start_idx : end_idx]] is merged into a single token. If
|
||||
| #[code start_idx] and #[end_idx] do not mark start and end token
|
||||
| #[code start_idx] and #[code end_idx] do not mark start and end token
|
||||
| boundaries, the document remains unchanged.
|
||||
|
||||
+aside-code("Example").
|
||||
|
@ -405,7 +403,7 @@ p
|
|||
| attributes are inherited from the syntactic root token of
|
||||
| the span.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Token]
|
||||
+cell
|
||||
|
@ -440,7 +438,7 @@ p
|
|||
+cell bool
|
||||
+cell Don't include arcs or modifiers.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell dict
|
||||
+cell Parse tree as dict.
|
||||
|
@ -462,7 +460,7 @@ p
|
|||
assert ents[0].text == 'Mr. Best'
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Span]
|
||||
+cell Entities in the document.
|
||||
|
@ -485,7 +483,7 @@ p
|
|||
assert chunks[1].text == "another phrase"
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Span]
|
||||
+cell Noun chunks in the document.
|
||||
|
@ -507,7 +505,7 @@ p
|
|||
assert [s.root.text for s in sents] == ["is", "'s"]
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Span]
|
||||
+cell Sentences in the document.
|
||||
|
@ -525,7 +523,7 @@ p
|
|||
assert doc.has_vector
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the document has a vector data attached.
|
||||
|
@ -544,7 +542,7 @@ p
|
|||
assert doc.vector.shape == (300,)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell A 1D numpy array representing the document's semantics.
|
||||
|
@ -564,7 +562,7 @@ p
|
|||
assert doc1.vector_norm != doc2.vector_norm
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell float
|
||||
+cell The L2 norm of the vector representation.
|
5
website/api/entityrecognizer.jade
Normal file
5
website/api/entityrecognizer.jade
Normal file
|
@ -0,0 +1,5 @@
|
|||
//- 💫 DOCS > API > ENTITYRECOGNIZER
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
!=partial("pipe", { subclass: "EntityRecognizer", short: "ner", pipeline_id: "ner" })
|
|
@ -1,14 +1,12 @@
|
|||
//- 💫 DOCS > API > GOLDCORPUS
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| An annotated corpus, using the JSON file format. Manages annotations for
|
||||
| tagging, dependency parsing and NER.
|
||||
| This class manages annotations for tagging, dependency parsing and NER.
|
||||
|
||||
+h(2, "init") GoldCorpus.__init__
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
p Create a #[code GoldCorpus].
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
//- 💫 DOCS > API > GOLDPARSE
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p Collection for training annotations.
|
||||
|
||||
|
@ -40,7 +40,7 @@ p Create a #[code GoldParse].
|
|||
+cell iterable
|
||||
+cell A sequence of named entity annotations, either as BILUO tag strings, or as #[code (start_char, end_char, label)] tuples, representing the entity positions.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code GoldParse]
|
||||
+cell The newly constructed object.
|
||||
|
@ -51,7 +51,7 @@ p Create a #[code GoldParse].
|
|||
p Get the number of gold-standard tokens.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of gold-standard tokens.
|
||||
|
@ -64,7 +64,7 @@ p
|
|||
| tree.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether annotations form projective tree.
|
||||
|
@ -119,7 +119,7 @@ p
|
|||
|
||||
p
|
||||
| Encode labelled spans into per-token tags, using the
|
||||
| #[+a("/docs/api/annotation#biluo") BILUO scheme] (Begin/In/Last/Unit/Out).
|
||||
| #[+a("/api/annotation#biluo") BILUO scheme] (Begin/In/Last/Unit/Out).
|
||||
|
||||
p
|
||||
| Returns a list of unicode strings, describing the tags. Each tag string
|
||||
|
@ -157,11 +157,11 @@ p
|
|||
| and #[code end] should be character-offset integers denoting the
|
||||
| slice into the original string.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell list
|
||||
+cell
|
||||
| Unicode strings, describing the
|
||||
| #[+a("/docs/api/annotation#biluo") BILUO] tags.
|
||||
| #[+a("/api/annotation#biluo") BILUO] tags.
|
||||
|
||||
|
14
website/api/index.jade
Normal file
14
website/api/index.jade
Normal file
|
@ -0,0 +1,14 @@
|
|||
//- 💫 DOCS > API > ARCHITECTURE
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
+section("basics")
|
||||
include ../usage/_spacy-101/_architecture
|
||||
|
||||
+section("nn-model")
|
||||
+h(2, "nn-model") Neural network model architecture
|
||||
include _architecture/_nn-model
|
||||
|
||||
+section("cython")
|
||||
+h(2, "cython") Cython conventions
|
||||
include _architecture/_cython
|
|
@ -1,10 +1,10 @@
|
|||
//- 💫 DOCS > API > LANGUAGE
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| A text-processing pipeline. Usually you'll load this once per process,
|
||||
| and pass the instance around your application.
|
||||
| Usually you'll load this once per process as #[code nlp] and pass the
|
||||
| instance around your application.
|
||||
|
||||
+h(2, "init") Language.__init__
|
||||
+tag method
|
||||
|
@ -49,7 +49,7 @@ p Initialise a #[code Language] object.
|
|||
| Custom meta data for the #[code Language] class. Is written to by
|
||||
| models to add model meta data.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Language]
|
||||
+cell The newly constructed object.
|
||||
|
@ -77,14 +77,14 @@ p
|
|||
+cell list
|
||||
+cell
|
||||
| Names of pipeline components to
|
||||
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
|
||||
| #[+a("/usage/processing-pipelines#disabling") disable].
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Doc]
|
||||
+cell A container for accessing the annotations.
|
||||
|
||||
+infobox("⚠️ Deprecation note")
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| Pipeline components to prevent from being loaded can now be added as
|
||||
| a list to #[code disable], instead of specifying one keyword argument
|
||||
|
@ -136,9 +136,9 @@ p
|
|||
+cell list
|
||||
+cell
|
||||
| Names of pipeline components to
|
||||
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
|
||||
| #[+a("/usage/processing-pipelines#disabling") disable].
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Doc]
|
||||
+cell Documents in the order of the original text.
|
||||
|
@ -175,7 +175,7 @@ p Update the models in the pipeline.
|
|||
+cell callable
|
||||
+cell An optimizer.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell dict
|
||||
+cell Results from the update.
|
||||
|
@ -200,7 +200,7 @@ p
|
|||
+cell -
|
||||
+cell Config parameters.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell tuple
|
||||
+cell An optimizer.
|
||||
|
@ -242,7 +242,7 @@ p
|
|||
+cell iterable
|
||||
+cell Tuples of #[code Doc] and #[code GoldParse] objects.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell tuple
|
||||
+cell Tuples of #[code Doc] and #[code GoldParse] objects.
|
||||
|
@ -271,7 +271,7 @@ p
|
|||
+cell list
|
||||
+cell
|
||||
| Names of pipeline components to
|
||||
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable]
|
||||
| #[+a("/usage/processing-pipelines#disabling") disable]
|
||||
| and prevent from being saved.
|
||||
|
||||
+h(2, "from_disk") Language.from_disk
|
||||
|
@ -300,14 +300,14 @@ p
|
|||
+cell list
|
||||
+cell
|
||||
| Names of pipeline components to
|
||||
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
|
||||
| #[+a("/usage/processing-pipelines#disabling") disable].
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Language]
|
||||
+cell The modified #[code Language] object.
|
||||
|
||||
+infobox("⚠️ Deprecation note")
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| As of spaCy v2.0, the #[code save_to_directory] method has been
|
||||
| renamed to #[code to_disk], to improve consistency across classes.
|
||||
|
@ -332,10 +332,10 @@ p Serialize the current state to a binary string.
|
|||
+cell list
|
||||
+cell
|
||||
| Names of pipeline components to
|
||||
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable]
|
||||
| #[+a("/usage/processing-pipelines#disabling") disable]
|
||||
| and prevent from being serialized.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bytes
|
||||
+cell The serialized form of the #[code Language] object.
|
||||
|
@ -362,14 +362,14 @@ p Load state from a binary string.
|
|||
+cell list
|
||||
+cell
|
||||
| Names of pipeline components to
|
||||
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
|
||||
| #[+a("/usage/processing-pipelines#disabling") disable].
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Language]
|
||||
+cell The #[code Language] object.
|
||||
|
||||
+infobox("⚠️ Deprecation note")
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| Pipeline components to prevent from being loaded can now be added as
|
||||
| a list to #[code disable], instead of specifying one keyword argument
|
5
website/api/lemmatizer.jade
Normal file
5
website/api/lemmatizer.jade
Normal file
|
@ -0,0 +1,5 @@
|
|||
//- 💫 DOCS > API > LEMMATIZER
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
+under-construction
|
|
@ -1,6 +1,6 @@
|
|||
//- 💫 DOCS > API > LEXEME
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| An entry in the vocabulary. A #[code Lexeme] has no string context – it's
|
||||
|
@ -24,7 +24,7 @@ p Create a #[code Lexeme] object.
|
|||
+cell int
|
||||
+cell The orth id of the lexeme.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Lexeme]
|
||||
+cell The newly constructed object.
|
||||
|
@ -65,7 +65,7 @@ p Check the value of a boolean flag.
|
|||
+cell int
|
||||
+cell The attribute ID of the flag to query.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell The value of the flag.
|
||||
|
@ -91,7 +91,7 @@ p Compute a semantic similarity estimate. Defaults to cosine over vectors.
|
|||
| The object to compare with. By default, accepts #[code Doc],
|
||||
| #[code Span], #[code Token] and #[code Lexeme] objects.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell float
|
||||
+cell A scalar similarity score. Higher is more similar.
|
||||
|
@ -110,7 +110,7 @@ p
|
|||
assert apple.has_vector
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the lexeme has a vector data attached.
|
||||
|
@ -127,7 +127,7 @@ p A real-valued meaning representation.
|
|||
assert apple.vector.shape == (300,)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell A 1D numpy array representing the lexeme's semantics.
|
||||
|
@ -146,7 +146,7 @@ p The L2 norm of the lexeme's vector representation.
|
|||
assert apple.vector_norm != pasta.vector_norm
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell float
|
||||
+cell The L2 norm of the vector representation.
|
|
@ -1,10 +1,8 @@
|
|||
//- 💫 DOCS > API > MATCHER
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p Match sequences of tokens, based on pattern rules.
|
||||
|
||||
+infobox("⚠️ Deprecation note")
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
| As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
|
||||
| are deprecated and have been replaced with a simpler
|
||||
| #[+api("matcher#add") #[code Matcher.add]] that lets you add a list of
|
||||
|
@ -39,7 +37,7 @@ p Create the rule-based #[code Matcher].
|
|||
+cell dict
|
||||
+cell Patterns to add to the matcher, keyed by ID.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Matcher]
|
||||
+cell The newly constructed object.
|
||||
|
@ -64,7 +62,7 @@ p Find all token sequences matching the supplied patterns on the #[code Doc].
|
|||
+cell #[code Doc]
|
||||
+cell The document to match over.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell list
|
||||
+cell
|
||||
|
@ -81,7 +79,7 @@ p Find all token sequences matching the supplied patterns on the #[code Doc].
|
|||
| actions per pattern within the same matcher. For example, you might only
|
||||
| want to merge some entity types, and set custom flags for other matched
|
||||
| patterns. For more details and examples, see the usage guide on
|
||||
| #[+a("/docs/usage/rule-based-matching") rule-based matching].
|
||||
| #[+a("/usage/linguistic-features#rule-based-matching") rule-based matching].
|
||||
|
||||
+h(2, "pipe") Matcher.pipe
|
||||
+tag method
|
||||
|
@ -113,7 +111,7 @@ p Match a stream of documents, yielding them in turn.
|
|||
| parallel, if the #[code Matcher] implementation supports
|
||||
| multi-threading.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Doc]
|
||||
+cell Documents, in order.
|
||||
|
@ -134,7 +132,7 @@ p
|
|||
assert len(matcher) == 1
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of rules.
|
||||
|
@ -156,7 +154,8 @@ p Check whether the matcher contains rules for a match ID.
|
|||
+cell #[code key]
|
||||
+cell unicode
|
||||
+cell The match ID.
|
||||
+footrow
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell Whether the matcher contains rules for this match ID.
|
||||
|
@ -203,7 +202,7 @@ p
|
|||
| Match pattern. A pattern consists of a list of dicts, where each
|
||||
| dict describes a token.
|
||||
|
||||
+infobox("⚠️ Deprecation note")
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
|
||||
| are deprecated and have been replaced with a simpler
|
||||
|
@ -257,7 +256,7 @@ p
|
|||
+cell unicode
|
||||
+cell The ID of the match rule.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell tuple
|
||||
+cell The rule, as an #[code (on_match, patterns)] tuple.
|
181
website/api/phrasematcher.jade
Normal file
181
website/api/phrasematcher.jade
Normal file
|
@ -0,0 +1,181 @@
|
|||
//- 💫 DOCS > API > PHRASEMATCHER
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| The #[code PhraseMatcher] lets you efficiently match large terminology
|
||||
| lists. While the #[+api("matcher") #[code Matcher]] lets you match
|
||||
| squences based on lists of token descriptions, the #[code PhraseMatcher]
|
||||
| accepts match patterns in the form of #[code Doc] objects.
|
||||
|
||||
+h(2, "init") PhraseMatcher.__init__
|
||||
+tag method
|
||||
|
||||
p Create the rule-based #[code PhraseMatcher].
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.matcher import PhraseMatcher
|
||||
matcher = Matcher(nlp.vocab, max_length=6)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code vocab]
|
||||
+cell #[code Vocab]
|
||||
+cell
|
||||
| The vocabulary object, which must be shared with the documents
|
||||
| the matcher will operate on.
|
||||
|
||||
+row
|
||||
+cell #[code max_length]
|
||||
+cell int
|
||||
+cell Mamimum length of a phrase pattern to add.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code PhraseMatcher]
|
||||
+cell The newly constructed object.
|
||||
|
||||
+h(2, "call") PhraseMatcher.__call__
|
||||
+tag method
|
||||
|
||||
p Find all token sequences matching the supplied patterns on the #[code Doc].
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.matcher import Matcher
|
||||
|
||||
matcher = Matcher(nlp.vocab)
|
||||
matcher.add('OBAMA', None, nlp(u"Barack Obama"))
|
||||
doc = nlp(u"Barack Obama lifts America one last time in emotional farewell")
|
||||
matches = matcher(doc)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The document to match over.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell list
|
||||
+cell
|
||||
| A list of #[code (match_id, start, end)] tuples, describing the
|
||||
| matches. A match tuple describes a span #[code doc[start:end]].
|
||||
| The #[code match_id] is the ID of the added match pattern.
|
||||
|
||||
+h(2, "pipe") PhraseMatcher.pipe
|
||||
+tag method
|
||||
|
||||
p Match a stream of documents, yielding them in turn.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.matcher import PhraseMatcher
|
||||
matcher = PhraseMatcher(nlp.vocab)
|
||||
for doc in matcher.pipe(texts, batch_size=50, n_threads=4):
|
||||
pass
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code docs]
|
||||
+cell iterable
|
||||
+cell A stream of documents.
|
||||
|
||||
+row
|
||||
+cell #[code batch_size]
|
||||
+cell int
|
||||
+cell The number of documents to accumulate into a working set.
|
||||
|
||||
+row
|
||||
+cell #[code n_threads]
|
||||
+cell int
|
||||
+cell
|
||||
| The number of threads with which to work on the buffer in
|
||||
| parallel, if the #[code PhraseMatcher] implementation supports
|
||||
| multi-threading.
|
||||
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Doc]
|
||||
+cell Documents, in order.
|
||||
|
||||
+h(2, "len") PhraseMatcher.__len__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Get the number of rules added to the matcher. Note that this only returns
|
||||
| the number of rules (identical with the number of IDs), not the number
|
||||
| of individual patterns.
|
||||
|
||||
+aside-code("Example").
|
||||
matcher = PhraseMatcher(nlp.vocab)
|
||||
assert len(matcher) == 0
|
||||
matcher.add('OBAMA', None, nlp(u"Barack Obama"))
|
||||
assert len(matcher) == 1
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of rules.
|
||||
|
||||
+h(2, "contains") PhraseMatcher.__contains__
|
||||
+tag method
|
||||
|
||||
p Check whether the matcher contains rules for a match ID.
|
||||
|
||||
+aside-code("Example").
|
||||
matcher = PhraseMatcher(nlp.vocab)
|
||||
assert len(matcher) == 0
|
||||
matcher.add('OBAMA', None, nlp(u"Barack Obama"))
|
||||
assert len(matcher) == 1
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code key]
|
||||
+cell unicode
|
||||
+cell The match ID.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell Whether the matcher contains rules for this match ID.
|
||||
|
||||
+h(2, "add") PhraseMatcher.add
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Add a rule to the matcher, consisting of an ID key, one or more patterns, and
|
||||
| a callback function to act on the matches. The callback function will
|
||||
| receive the arguments #[code matcher], #[code doc], #[code i] and
|
||||
| #[code matches]. If a pattern already exists for the given ID, the
|
||||
| patterns will be extended. An #[code on_match] callback will be
|
||||
| overwritten.
|
||||
|
||||
+aside-code("Example").
|
||||
def on_match(matcher, doc, id, matches):
|
||||
print('Matched!', matches)
|
||||
|
||||
matcher = PhraseMatcher(nlp.vocab)
|
||||
matcher.add('OBAMA', on_match, nlp(u"Barack Obama"))
|
||||
matcher.add('HEALTH', on_match, nlp(u"health care reform"),
|
||||
nlp(u"healthcare reform"))
|
||||
doc = nlp(u"Barack Obama urges Congress to find courage to defend his healthcare reforms")
|
||||
matches = matcher(doc)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code match_id]
|
||||
+cell unicode
|
||||
+cell An ID for the thing you're matching.
|
||||
|
||||
+row
|
||||
+cell #[code on_match]
|
||||
+cell callable or #[code None]
|
||||
+cell
|
||||
| Callback function to act on matches. Takes the arguments
|
||||
| #[code matcher], #[code doc], #[code i] and #[code matches].
|
||||
|
||||
+row
|
||||
+cell #[code *docs]
|
||||
+cell list
|
||||
+cell
|
||||
| #[code Doc] objects of the phrases to match.
|
390
website/api/pipe.jade
Normal file
390
website/api/pipe.jade
Normal file
|
@ -0,0 +1,390 @@
|
|||
//- 💫 DOCS > API > PIPE
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
//- This page can be used as a template for all other classes that inherit
|
||||
//- from `Pipe`.
|
||||
|
||||
if subclass
|
||||
+infobox
|
||||
| This class is a subclass of #[+api("pipe") #[code Pipe]] and
|
||||
| follows the same API. The pipeline component is available in the
|
||||
| #[+a("/usage/processing-pipelines") processing pipeline] via the ID
|
||||
| #[code "#{pipeline_id}"].
|
||||
|
||||
else
|
||||
p
|
||||
| This class is not instantiated directly. Components inherit from it,
|
||||
| and it defines the interface that components should follow to
|
||||
| function as components in a spaCy analysis pipeline.
|
||||
|
||||
- CLASSNAME = subclass || 'Pipe'
|
||||
- VARNAME = short || CLASSNAME.toLowerCase()
|
||||
|
||||
|
||||
+h(2, "model") #{CLASSNAME}.Model
|
||||
+tag classmethod
|
||||
|
||||
p
|
||||
| Initialise a model for the pipe. The model should implement the
|
||||
| #[code thinc.neural.Model] API. Wrappers are available for
|
||||
| #[+a("/usage/deep-learning") most major machine learning libraries].
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code **kwargs]
|
||||
+cell -
|
||||
+cell Parameters for initialising the model
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell object
|
||||
+cell The initialised model.
|
||||
|
||||
+h(2, "init") #{CLASSNAME}.__init__
|
||||
+tag method
|
||||
|
||||
p Create a new pipeline instance.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.pipeline import #{CLASSNAME}
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code vocab]
|
||||
+cell #[code Vocab]
|
||||
+cell The shared vocabulary.
|
||||
|
||||
+row
|
||||
+cell #[code model]
|
||||
+cell #[code thinc.neural.Model] or #[code True]
|
||||
+cell
|
||||
| The model powering the pipeline component. If no model is
|
||||
| supplied, the model is created when you call
|
||||
| #[code begin_training], #[code from_disk] or #[code from_bytes].
|
||||
|
||||
+row
|
||||
+cell #[code **cfg]
|
||||
+cell -
|
||||
+cell Configuration parameters.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code=CLASSNAME]
|
||||
+cell The newly constructed object.
|
||||
|
||||
+h(2, "call") #{CLASSNAME}.__call__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Apply the pipe to one document. The document is modified in place, and
|
||||
| returned. Both #[code #{CLASSNAME}.__call__] and
|
||||
| #[code #{CLASSNAME}.pipe] should delegate to the
|
||||
| #[code #{CLASSNAME}.predict] and #[code #{CLASSNAME}.set_annotations]
|
||||
| methods.
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
doc = nlp(u"This is a sentence.")
|
||||
processed = #{VARNAME}(doc)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The document to process.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Doc]
|
||||
+cell The processed document.
|
||||
|
||||
+h(2, "pipe") #{CLASSNAME}.pipe
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Apply the pipe to a stream of documents. Both
|
||||
| #[code #{CLASSNAME}.__call__] and #[code #{CLASSNAME}.pipe] should
|
||||
| delegate to the #[code #{CLASSNAME}.predict] and
|
||||
| #[code #{CLASSNAME}.set_annotations] methods.
|
||||
|
||||
+aside-code("Example").
|
||||
texts = [u'One doc', u'...', u'Lots of docs']
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
for doc in #{VARNAME}.pipe(texts, batch_size=50):
|
||||
pass
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code stream]
|
||||
+cell iterable
|
||||
+cell A stream of documents.
|
||||
|
||||
+row
|
||||
+cell #[code batch_size]
|
||||
+cell int
|
||||
+cell The number of texts to buffer. Defaults to #[code 128].
|
||||
|
||||
+row
|
||||
+cell #[code n_threads]
|
||||
+cell int
|
||||
+cell
|
||||
| The number of worker threads to use. If #[code -1], OpenMP will
|
||||
| decide how many to use at run time. Default is #[code -1].
|
||||
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Doc]
|
||||
+cell Processed documents in the order of the original text.
|
||||
|
||||
+h(2, "predict") #{CLASSNAME}.predict
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Apply the pipeline's model to a batch of docs, without modifying them.
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
scores = #{VARNAME}.predict([doc1, doc2])
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code docs]
|
||||
+cell iterable
|
||||
+cell The documents to predict.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell -
|
||||
+cell Scores from the model.
|
||||
|
||||
+h(2, "set_annotations") #{CLASSNAME}.set_annotations
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Modify a batch of documents, using pre-computed scores.
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
scores = #{VARNAME}.predict([doc1, doc2])
|
||||
#{VARNAME}.set_annotations([doc1, doc2], scores)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code docs]
|
||||
+cell iterable
|
||||
+cell The documents to modify.
|
||||
|
||||
+row
|
||||
+cell #[code scores]
|
||||
+cell -
|
||||
+cell The scores to set, produced by #[code #{CLASSNAME}.predict].
|
||||
|
||||
+h(2, "update") #{CLASSNAME}.update
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Learn from a batch of documents and gold-standard information, updating
|
||||
| the pipe's model. Delegates to #[code #{CLASSNAME}.predict] and
|
||||
| #[code #{CLASSNAME}.get_loss].
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
losses = {}
|
||||
optimizer = nlp.begin_training()
|
||||
#{VARNAME}.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code docs]
|
||||
+cell iterable
|
||||
+cell A batch of documents to learn from.
|
||||
|
||||
+row
|
||||
+cell #[code golds]
|
||||
+cell iterable
|
||||
+cell The gold-standard data. Must have the same length as #[code docs].
|
||||
|
||||
+row
|
||||
+cell #[code drop]
|
||||
+cell int
|
||||
+cell The dropout rate.
|
||||
|
||||
+row
|
||||
+cell #[code sgd]
|
||||
+cell callable
|
||||
+cell
|
||||
| The optimizer. Should take two arguments #[code weights] and
|
||||
| #[code gradient], and an optional ID.
|
||||
|
||||
+row
|
||||
+cell #[code losses]
|
||||
+cell dict
|
||||
+cell
|
||||
| Optional record of the loss during training. The value keyed by
|
||||
| the model's name is updated.
|
||||
|
||||
+h(2, "get_loss") #{CLASSNAME}.get_loss
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Find the loss and gradient of loss for the batch of documents and their
|
||||
| predicted scores.
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
scores = #{VARNAME}.predict([doc1, doc2])
|
||||
loss, d_loss = #{VARNAME}.get_loss([doc1, doc2], [gold1, gold2], scores)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code docs]
|
||||
+cell iterable
|
||||
+cell The batch of documents.
|
||||
|
||||
+row
|
||||
+cell #[code golds]
|
||||
+cell iterable
|
||||
+cell The gold-standard data. Must have the same length as #[code docs].
|
||||
|
||||
+row
|
||||
+cell #[code scores]
|
||||
+cell -
|
||||
+cell Scores representing the model's predictions.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell tuple
|
||||
+cell The loss and the gradient, i.e. #[code (loss, gradient)].
|
||||
|
||||
+h(2, "begin_training") #{CLASSNAME}.begin_training
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Initialize the pipe for training, using data exampes if available. If no
|
||||
| model has been initialized yet, the model is added.
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
nlp.pipeline.append(#{VARNAME})
|
||||
#{VARNAME}.begin_training(pipeline=nlp.pipeline)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code gold_tuples]
|
||||
+cell iterable
|
||||
+cell
|
||||
| Optional gold-standard annotations from which to construct
|
||||
| #[+api("goldparse") #[code GoldParse]] objects.
|
||||
|
||||
+row
|
||||
+cell #[code pipeline]
|
||||
+cell list
|
||||
+cell
|
||||
| Optional list of #[+api("pipe") #[code Pipe]] components that
|
||||
| this component is part of.
|
||||
|
||||
+h(2, "use_params") #{CLASSNAME}.use_params
|
||||
+tag method
|
||||
+tag contextmanager
|
||||
|
||||
p Modify the pipe's model, to use the given parameter values.
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
with #{VARNAME}.use_params():
|
||||
#{VARNAME}.to_disk('/best_model')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code params]
|
||||
+cell -
|
||||
+cell
|
||||
| The parameter values to use in the model. At the end of the
|
||||
| context, the original parameters are restored.
|
||||
|
||||
+h(2, "to_disk") #{CLASSNAME}.to_disk
|
||||
+tag method
|
||||
|
||||
p Serialize the pipe to disk.
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
#{VARNAME}.to_disk('/path/to/#{VARNAME}')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code path]
|
||||
+cell unicode or #[code Path]
|
||||
+cell
|
||||
| A path to a directory, which will be created if it doesn't exist.
|
||||
| Paths may be either strings or #[code Path]-like objects.
|
||||
|
||||
+h(2, "from_disk") #{CLASSNAME}.from_disk
|
||||
+tag method
|
||||
|
||||
p Load the pipe from disk. Modifies the object in place and returns it.
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
#{VARNAME}.from_disk('/path/to/#{VARNAME}')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code path]
|
||||
+cell unicode or #[code Path]
|
||||
+cell
|
||||
| A path to a directory. Paths may be either strings or
|
||||
| #[code Path]-like objects.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code=CLASSNAME]
|
||||
+cell The modified #[code=CLASSNAME] object.
|
||||
|
||||
+h(2, "to_bytes") #{CLASSNAME}.to_bytes
|
||||
+tag method
|
||||
|
||||
+aside-code("example").
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
#{VARNAME}_bytes = #{VARNAME}.to_bytes()
|
||||
|
||||
p Serialize the pipe to a bytestring.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code **exclude]
|
||||
+cell -
|
||||
+cell Named attributes to prevent from being serialized.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bytes
|
||||
+cell The serialized form of the #[code=CLASSNAME] object.
|
||||
|
||||
+h(2, "from_bytes") #{CLASSNAME}.from_bytes
|
||||
+tag method
|
||||
|
||||
p Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||
|
||||
+aside-code("Example").
|
||||
#{VARNAME}_bytes = #{VARNAME}.to_bytes()
|
||||
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
|
||||
#{VARNAME}.from_bytes(#{VARNAME}_bytes)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code bytes_data]
|
||||
+cell bytes
|
||||
+cell The data to load from.
|
||||
|
||||
+row
|
||||
+cell #[code **exclude]
|
||||
+cell -
|
||||
+cell Named attributes to prevent from being loaded.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code=CLASSNAME]
|
||||
+cell The #[code=CLASSNAME] object.
|
|
@ -1,6 +1,6 @@
|
|||
//- 💫 DOCS > API > SPAN
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p A slice from a #[+api("doc") #[code Doc]] object.
|
||||
|
||||
|
@ -40,7 +40,7 @@ p Create a Span object from the #[code slice doc[start : end]].
|
|||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell A meaning representation of the span.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Span]
|
||||
+cell The newly constructed object.
|
||||
|
@ -61,7 +61,7 @@ p Get a #[code Token] object.
|
|||
+cell int
|
||||
+cell The index of the token within the span.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Token]
|
||||
+cell The token at #[code span[i]].
|
||||
|
@ -79,7 +79,7 @@ p Get a #[code Span] object.
|
|||
+cell tuple
|
||||
+cell The slice of the span to get.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Span]
|
||||
+cell The span at #[code span[start : end]].
|
||||
|
@ -95,7 +95,7 @@ p Iterate over #[code Token] objects.
|
|||
assert [t.text for t in span] == ['it', 'back', '!']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Token]
|
||||
+cell A #[code Token] object.
|
||||
|
@ -111,7 +111,7 @@ p Get the number of tokens in the span.
|
|||
assert len(span) == 3
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of tokens in the span.
|
||||
|
@ -140,7 +140,7 @@ p
|
|||
| The object to compare with. By default, accepts #[code Doc],
|
||||
| #[code Span], #[code Token] and #[code Lexeme] objects.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell float
|
||||
+cell A scalar similarity score. Higher is more similar.
|
||||
|
@ -167,7 +167,7 @@ p
|
|||
+cell list
|
||||
+cell A list of attribute ID ints.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code.u-break numpy.ndarray[long, ndim=2]]
|
||||
+cell
|
||||
|
@ -194,7 +194,7 @@ p Retokenize the document, such that the span is merged into a single token.
|
|||
| Attributes to assign to the merged token. By default, attributes
|
||||
| are inherited from the syntactic root token of the span.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Token]
|
||||
+cell The newly merged token.
|
||||
|
@ -216,7 +216,7 @@ p
|
|||
assert new_york.root.text == 'York'
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Token]
|
||||
+cell The root token.
|
||||
|
@ -233,7 +233,7 @@ p Tokens that are to the left of the span, whose head is within the span.
|
|||
assert lefts == [u'New']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Token]
|
||||
+cell A left-child of a token of the span.
|
||||
|
@ -250,7 +250,7 @@ p Tokens that are to the right of the span, whose head is within the span.
|
|||
assert rights == [u'in']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Token]
|
||||
+cell A right-child of a token of the span.
|
||||
|
@ -267,7 +267,7 @@ p Tokens that descend from tokens in the span, but fall outside it.
|
|||
assert subtree == [u'Give', u'it', u'back', u'!']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Token]
|
||||
+cell A descendant of a token within the span.
|
||||
|
@ -285,7 +285,7 @@ p
|
|||
assert doc[1:].has_vector
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the span has a vector data attached.
|
||||
|
@ -304,7 +304,7 @@ p
|
|||
assert doc[1:].vector.shape == (300,)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell A 1D numpy array representing the span's semantics.
|
||||
|
@ -323,7 +323,7 @@ p
|
|||
assert doc[1:].vector_norm != doc[2:].vector_norm
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell float
|
||||
+cell The L2 norm of the vector representation.
|
|
@ -1,6 +1,6 @@
|
|||
//- 💫 DOCS > API > STRINGSTORE
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| Look up strings by 64-bit hashes. As of v2.0, spaCy uses hash values
|
||||
|
@ -23,7 +23,7 @@ p
|
|||
+cell iterable
|
||||
+cell A sequence of unicode strings to add to the store.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code StringStore]
|
||||
+cell The newly constructed object.
|
||||
|
@ -38,7 +38,7 @@ p Get the number of strings in the store.
|
|||
assert len(stringstore) == 2
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of strings in the store.
|
||||
|
@ -60,7 +60,7 @@ p Retrieve a string from a given hash, or vice versa.
|
|||
+cell bytes, unicode or uint64
|
||||
+cell The value to encode.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell unicode or int
|
||||
+cell The value to be retrieved.
|
||||
|
@ -81,7 +81,7 @@ p Check whether a string is in the store.
|
|||
+cell unicode
|
||||
+cell The string to check.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the store contains the string.
|
||||
|
@ -100,7 +100,7 @@ p
|
|||
assert all_strings == [u'apple', u'orange']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell unicode
|
||||
+cell A string in the store.
|
||||
|
@ -125,7 +125,7 @@ p Add a string to the #[code StringStore].
|
|||
+cell unicode
|
||||
+cell The string to add.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell uint64
|
||||
+cell The string's hash value.
|
||||
|
@ -166,7 +166,7 @@ p Loads state from a directory. Modifies the object in place and returns it.
|
|||
| A path to a directory. Paths may be either strings or
|
||||
| #[code Path]-like objects.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code StringStore]
|
||||
+cell The modified #[code StringStore] object.
|
||||
|
@ -185,7 +185,7 @@ p Serialize the current state to a binary string.
|
|||
+cell -
|
||||
+cell Named attributes to prevent from being serialized.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bytes
|
||||
+cell The serialized form of the #[code StringStore] object.
|
||||
|
@ -211,7 +211,7 @@ p Load state from a binary string.
|
|||
+cell -
|
||||
+cell Named attributes to prevent from being loaded.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code StringStore]
|
||||
+cell The #[code StringStore] object.
|
||||
|
@ -233,7 +233,7 @@ p Get a 64-bit hash for a given string.
|
|||
+cell unicode
|
||||
+cell The string to hash.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell uint64
|
||||
+cell The hash.
|
5
website/api/tagger.jade
Normal file
5
website/api/tagger.jade
Normal file
|
@ -0,0 +1,5 @@
|
|||
//- 💫 DOCS > API > TAGGER
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
!=partial("pipe", { subclass: "Tagger", pipeline_id: "tagger" })
|
5
website/api/tensorizer.jade
Normal file
5
website/api/tensorizer.jade
Normal file
|
@ -0,0 +1,5 @@
|
|||
//- 💫 DOCS > API > TENSORIZER
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
!=partial("pipe", { subclass: "Tensorizer", pipeline_id: "tensorizer" })
|
19
website/api/textcategorizer.jade
Normal file
19
website/api/textcategorizer.jade
Normal file
|
@ -0,0 +1,19 @@
|
|||
//- 💫 DOCS > API > TEXTCATEGORIZER
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| The model supports classification with multiple, non-mutually exclusive
|
||||
| labels. You can change the model architecture rather easily, but by
|
||||
| default, the #[code TextCategorizer] class uses a convolutional
|
||||
| neural network to assign position-sensitive vectors to each word in the
|
||||
| document. This step is similar to the #[+api("tensorizer") #[code Tensorizer]]
|
||||
| component, but the #[code TextCategorizer] uses its own CNN model, to
|
||||
| avoid sharing weights with the other pipeline components. The document
|
||||
| tensor is then
|
||||
| summarized by concatenating max and mean pooling, and a multilayer
|
||||
| perceptron is used to predict an output vector of length #[code nr_class],
|
||||
| before a logistic activation is applied elementwise. The value of each
|
||||
| output neuron is the probability that some class is present.
|
||||
|
||||
!=partial("pipe", { subclass: "TextCategorizer", short: "textcat", pipeline_id: "textcat" })
|
|
@ -1,6 +1,6 @@
|
|||
//- 💫 DOCS > API > TOKEN
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p An individual token — i.e. a word, punctuation symbol, whitespace, etc.
|
||||
|
||||
|
@ -30,7 +30,7 @@ p Construct a #[code Token] object.
|
|||
+cell int
|
||||
+cell The index of the token within the document.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Token]
|
||||
+cell The newly constructed object.
|
||||
|
@ -46,7 +46,7 @@ p The number of unicode characters in the token, i.e. #[code token.text].
|
|||
assert len(token) == 4
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of unicode characters in the token.
|
||||
|
@ -68,7 +68,7 @@ p Check the value of a boolean flag.
|
|||
+cell int
|
||||
+cell The attribute ID of the flag to check.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the flag is set.
|
||||
|
@ -93,7 +93,7 @@ p Compute a semantic similarity estimate. Defaults to cosine over vectors.
|
|||
| The object to compare with. By default, accepts #[code Doc],
|
||||
| #[code Span], #[code Token] and #[code Lexeme] objects.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell float
|
||||
+cell A scalar similarity score. Higher is more similar.
|
||||
|
@ -114,7 +114,7 @@ p Get a neighboring token.
|
|||
+cell int
|
||||
+cell The relative position of the token to get. Defaults to #[code 1].
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Token]
|
||||
+cell The token at position #[code self.doc[self.i+i]].
|
||||
|
@ -139,7 +139,7 @@ p
|
|||
+cell #[code Token]
|
||||
+cell Another token.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether this token is the ancestor of the descendant.
|
||||
|
@ -158,7 +158,7 @@ p The rightmost token of this token's syntactic descendants.
|
|||
assert [t.text for t in he_ancestors] == [u'pleaded']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Token]
|
||||
+cell
|
||||
|
@ -177,7 +177,7 @@ p A sequence of coordinated tokens, including the token itself.
|
|||
assert [t.text for t in apples_conjuncts] == [u'oranges']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Token]
|
||||
+cell A coordinated token.
|
||||
|
@ -194,7 +194,7 @@ p A sequence of the token's immediate syntactic children.
|
|||
assert [t.text for t in give_children] == [u'it', u'back', u'!']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Token]
|
||||
+cell A child token such that #[code child.head==self].
|
||||
|
@ -211,7 +211,7 @@ p A sequence of all the token's syntactic descendents.
|
|||
assert [t.text for t in give_subtree] == [u'Give', u'it', u'back', u'!']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Token]
|
||||
+cell A descendant token such that #[code self.is_ancestor(descendant)].
|
||||
|
@ -230,7 +230,7 @@ p
|
|||
assert apples.has_vector
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the token has a vector data attached.
|
||||
|
@ -248,7 +248,7 @@ p A real-valued meaning representation.
|
|||
assert apples.vector.shape == (300,)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell A 1D numpy array representing the token's semantics.
|
||||
|
@ -268,7 +268,7 @@ p The L2 norm of the token's vector representation.
|
|||
assert apples.vector_norm != pasta.vector_norm
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell float
|
||||
+cell The L2 norm of the vector representation.
|
||||
|
@ -280,20 +280,29 @@ p The L2 norm of the token's vector representation.
|
|||
+cell #[code text]
|
||||
+cell unicode
|
||||
+cell Verbatim text content.
|
||||
|
||||
+row
|
||||
+cell #[code text_with_ws]
|
||||
+cell unicode
|
||||
+cell Text content, with trailing space character if present.
|
||||
|
||||
+row
|
||||
+cell #[code whitespace]
|
||||
+cell int
|
||||
+cell Trailing space character if present.
|
||||
+row
|
||||
+cell #[code whitespace_]
|
||||
+cell unicode
|
||||
+cell Trailing space character if present.
|
||||
|
||||
+row
|
||||
+cell #[code orth]
|
||||
+cell int
|
||||
+cell ID of the verbatim text content.
|
||||
|
||||
+row
|
||||
+cell #[code orth_]
|
||||
+cell unicode
|
||||
+cell
|
||||
| Verbatim text content (identical to #[code Token.text]). Existst
|
||||
| mostly for consistency with the other attributes.
|
||||
|
||||
+row
|
||||
+cell #[code vocab]
|
||||
+cell #[code Vocab]
|
|
@ -1,6 +1,6 @@
|
|||
//- 💫 DOCS > API > TOKENIZER
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| Segment text, and create #[code Doc] objects with the discovered segment
|
||||
|
@ -57,7 +57,7 @@ p Create a #[code Tokenizer], to create #[code Doc] objects given unicode text.
|
|||
+cell callable
|
||||
+cell A boolean function matching strings to be recognised as tokens.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Tokenizer]
|
||||
+cell The newly constructed object.
|
||||
|
@ -77,7 +77,7 @@ p Tokenize a string.
|
|||
+cell unicode
|
||||
+cell The string to tokenize.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Doc]
|
||||
+cell A container for linguistic annotations.
|
||||
|
@ -110,7 +110,7 @@ p Tokenize a stream of texts.
|
|||
| The number of threads to use, if the implementation supports
|
||||
| multi-threading. The default tokenizer is single-threaded.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Doc]
|
||||
+cell A sequence of Doc objects, in order.
|
||||
|
@ -126,7 +126,7 @@ p Find internal split points of the string.
|
|||
+cell unicode
|
||||
+cell The string to split.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell list
|
||||
+cell
|
||||
|
@ -147,7 +147,7 @@ p
|
|||
+cell unicode
|
||||
+cell The string to segment.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The length of the prefix if present, otherwise #[code None].
|
||||
|
@ -165,7 +165,7 @@ p
|
|||
+cell unicode
|
||||
+cell The string to segment.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int / #[code None]
|
||||
+cell The length of the suffix if present, otherwise #[code None].
|
||||
|
@ -176,7 +176,7 @@ p
|
|||
p
|
||||
| Add a special-case tokenization rule. This mechanism is also used to add
|
||||
| custom tokenizer exceptions to the language data. See the usage guide
|
||||
| on #[+a("/docs/usage/adding-languages#tokenizer-exceptions") adding languages]
|
||||
| on #[+a("/usage/adding-languages#tokenizer-exceptions") adding languages]
|
||||
| for more details and examples.
|
||||
|
||||
+aside-code("Example").
|
24
website/api/top-level.jade
Normal file
24
website/api/top-level.jade
Normal file
|
@ -0,0 +1,24 @@
|
|||
//- 💫 DOCS > API > TOP-LEVEL
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
+section("spacy")
|
||||
//-+h(2, "spacy") spaCy
|
||||
//- spacy/__init__.py
|
||||
include _top-level/_spacy
|
||||
|
||||
+section("displacy")
|
||||
+h(2, "displacy", "spacy/displacy") displaCy
|
||||
include _top-level/_displacy
|
||||
|
||||
+section("util")
|
||||
+h(2, "util", "spacy/util.py") Utility functions
|
||||
include _top-level/_util
|
||||
|
||||
+section("compat")
|
||||
+h(2, "compat", "spacy/compaty.py") Compatibility functions
|
||||
include _top-level/_compat
|
||||
|
||||
+section("cli", "spacy/cli")
|
||||
+h(2, "cli") Command line
|
||||
include _top-level/_cli
|
333
website/api/vectors.jade
Normal file
333
website/api/vectors.jade
Normal file
|
@ -0,0 +1,333 @@
|
|||
//- 💫 DOCS > API > VECTORS
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| Vectors data is kept in the #[code Vectors.data] attribute, which should
|
||||
| be an instance of #[code numpy.ndarray] (for CPU vectors) or
|
||||
| #[code cupy.ndarray] (for GPU vectors).
|
||||
|
||||
+h(2, "init") Vectors.__init__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Create a new vector store. To keep the vector table empty, pass
|
||||
| #[code data_or_width=0]. You can also create the vector table and add
|
||||
| vectors one by one, or set the vector values directly on initialisation.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.vectors import Vectors
|
||||
from spacy.strings import StringStore
|
||||
|
||||
empty_vectors = Vectors(StringStore())
|
||||
|
||||
vectors = Vectors([u'cat'], 300)
|
||||
vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
|
||||
|
||||
vector_table = numpy.zeros((3, 300), dtype='f')
|
||||
vectors = Vectors(StringStore(), vector_table)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code strings]
|
||||
+cell #[code StringStore] or list
|
||||
+cell
|
||||
| List of strings, or a #[+api("stringstore") #[code StringStore]]
|
||||
| that maps strings to hash values, and vice versa.
|
||||
|
||||
+row
|
||||
+cell #[code data_or_width]
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']] or int
|
||||
+cell Vector data or number of dimensions.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Vectors]
|
||||
+cell The newly created object.
|
||||
|
||||
+h(2, "getitem") Vectors.__getitem__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Get a vector by key. If key is a string, it is hashed to an integer ID
|
||||
| using the #[code Vectors.strings] table. If the integer key is not found
|
||||
| in the table, a #[code KeyError] is raised.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
cat_vector = vectors[u'cat']
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code key]
|
||||
+cell unicode / int
|
||||
+cell The key to get the vector for.
|
||||
|
||||
+row
|
||||
+cell returns
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell The vector for the key.
|
||||
|
||||
+h(2, "setitem") Vectors.__setitem__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Set a vector for the given key. If key is a string, it is hashed to an
|
||||
| integer ID using the #[code Vectors.strings] table.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code key]
|
||||
+cell unicode / int
|
||||
+cell The key to set the vector for.
|
||||
|
||||
+row
|
||||
+cell #[code vector]
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell The vector to set.
|
||||
|
||||
+h(2, "iter") Vectors.__iter__
|
||||
+tag method
|
||||
|
||||
p Yield vectors from the table.
|
||||
|
||||
+aside-code("Example").
|
||||
vector_table = numpy.zeros((3, 300), dtype='f')
|
||||
vectors = Vectors(StringStore(), vector_table)
|
||||
for vector in vectors:
|
||||
print(vector)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell A vector from the table.
|
||||
|
||||
+h(2, "len") Vectors.__len__
|
||||
+tag method
|
||||
|
||||
p Return the number of vectors that have been assigned.
|
||||
|
||||
+aside-code("Example").
|
||||
vector_table = numpy.zeros((3, 300), dtype='f')
|
||||
vectors = Vectors(StringStore(), vector_table)
|
||||
assert len(vectors) == 3
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of vectors in the data.
|
||||
|
||||
+h(2, "contains") Vectors.__contains__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Check whether a key has a vector entry in the table. If key is a string,
|
||||
| it is hashed to an integer ID using the #[code Vectors.strings] table.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
assert u'cat' in vectors
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code key]
|
||||
+cell unicode / int
|
||||
+cell The key to check.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the key has a vector entry.
|
||||
|
||||
+h(2, "add") Vectors.add
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Add a key to the table, optionally setting a vector value as well. If
|
||||
| key is a string, it is hashed to an integer ID using the
|
||||
| #[code Vectors.strings] table.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code key]
|
||||
+cell unicode / int
|
||||
+cell The key to add.
|
||||
|
||||
+row
|
||||
+cell #[code vector]
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell An optional vector to add.
|
||||
|
||||
+h(2, "items") Vectors.items
|
||||
+tag method
|
||||
|
||||
p Iterate over #[code (string key, vector)] pairs, in order.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
for key, vector in vectors.items():
|
||||
print(key, vector)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell tuple
|
||||
+cell #[code (string key, vector)] pairs, in order.
|
||||
|
||||
+h(2, "shape") Vectors.shape
|
||||
+tag property
|
||||
|
||||
p
|
||||
| Get #[code (rows, dims)] tuples of number of rows and number of
|
||||
| dimensions in the vector table.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
rows, dims = vectors.shape
|
||||
assert rows == 1
|
||||
assert dims == 300
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell tuple
|
||||
+cell #[code (rows, dims)] pairs.
|
||||
|
||||
+h(2, "from_glove") Vectors.from_glove
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Load #[+a("https://nlp.stanford.edu/projects/glove/") GloVe] vectors from
|
||||
| a directory. Assumes binary format, that the vocab is in a
|
||||
| #[code vocab.txt], and that vectors are named
|
||||
| #[code vectors.{size}.[fd].bin], e.g. #[code vectors.128.f.bin] for 128d
|
||||
| float32 vectors, #[code vectors.300.d.bin] for 300d float64 (double)
|
||||
| vectors, etc. By default GloVe outputs 64-bit vectors.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code path]
|
||||
+cell unicode / #[code Path]
|
||||
+cell The path to load the GloVe vectors from.
|
||||
|
||||
+h(2, "to_disk") Vectors.to_disk
|
||||
+tag method
|
||||
|
||||
p Save the current state to a directory.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors.to_disk('/path/to/vectors')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code path]
|
||||
+cell unicode or #[code Path]
|
||||
+cell
|
||||
| A path to a directory, which will be created if it doesn't exist.
|
||||
| Paths may be either strings or #[code Path]-like objects.
|
||||
|
||||
+h(2, "from_disk") Vectors.from_disk
|
||||
+tag method
|
||||
|
||||
p Loads state from a directory. Modifies the object in place and returns it.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore())
|
||||
vectors.from_disk('/path/to/vectors')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code path]
|
||||
+cell unicode or #[code Path]
|
||||
+cell
|
||||
| A path to a directory. Paths may be either strings or
|
||||
| #[code Path]-like objects.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Vectors]
|
||||
+cell The modified #[code Vectors] object.
|
||||
|
||||
+h(2, "to_bytes") Vectors.to_bytes
|
||||
+tag method
|
||||
|
||||
p Serialize the current state to a binary string.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors_bytes = vectors.to_bytes()
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code **exclude]
|
||||
+cell -
|
||||
+cell Named attributes to prevent from being serialized.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bytes
|
||||
+cell The serialized form of the #[code Vectors] object.
|
||||
|
||||
+h(2, "from_bytes") Vectors.from_bytes
|
||||
+tag method
|
||||
|
||||
p Load state from a binary string.
|
||||
|
||||
+aside-code("Example").
|
||||
fron spacy.vectors import Vectors
|
||||
vectors_bytes = vectors.to_bytes()
|
||||
new_vectors = Vectors(StringStore())
|
||||
new_vectors.from_bytes(vectors_bytes)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code bytes_data]
|
||||
+cell bytes
|
||||
+cell The data to load from.
|
||||
|
||||
+row
|
||||
+cell #[code **exclude]
|
||||
+cell -
|
||||
+cell Named attributes to prevent from being loaded.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Vectors]
|
||||
+cell The #[code Vectors] object.
|
||||
|
||||
+h(2, "attributes") Attributes
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code data]
|
||||
+cell #[code numpy.ndarray] / #[code cupy.ndarray]
|
||||
+cell
|
||||
| Stored vectors data. #[code numpy] is used for CPU vectors,
|
||||
| #[code cupy] for GPU vectors.
|
||||
|
||||
+row
|
||||
+cell #[code key2row]
|
||||
+cell dict
|
||||
+cell
|
||||
| Dictionary mapping word hashes to rows in the
|
||||
| #[code Vectors.data] table.
|
||||
|
||||
+row
|
||||
+cell #[code keys]
|
||||
+cell #[code numpy.ndarray]
|
||||
+cell
|
||||
| Array keeping the keys in order, such that
|
||||
| #[code keys[vectors.key2row[key]] == key]
|
|
@ -1,17 +1,22 @@
|
|||
//- 💫 DOCS > API > VOCAB
|
||||
|
||||
include ../../_includes/_mixins
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| A lookup table that allows you to access #[code Lexeme] objects. The
|
||||
| #[code Vocab] instance also provides access to the #[code StringStore],
|
||||
| and owns underlying C-data that is shared between #[code Doc] objects.
|
||||
| The #[code Vocab] object provides a lookup table that allows you to
|
||||
| access #[+api("lexeme") #[code Lexeme]] objects, as well as the
|
||||
| #[+api("stringstore") #[code StringStore]]. It also owns underlying
|
||||
| C-data that is shared between #[code Doc] objects.
|
||||
|
||||
+h(2, "init") Vocab.__init__
|
||||
+tag method
|
||||
|
||||
p Create the vocabulary.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.vocab import Vocab
|
||||
vocab = Vocab(strings=[u'hello', u'world'])
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code lex_attr_getters]
|
||||
|
@ -39,7 +44,7 @@ p Create the vocabulary.
|
|||
| A #[+api("stringstore") #[code StringStore]] that maps
|
||||
| strings to hash values, and vice versa, or a list of strings.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Vocab]
|
||||
+cell The newly constructed object.
|
||||
|
@ -54,7 +59,7 @@ p Get the current number of lexemes in the vocabulary.
|
|||
assert len(nlp.vocab) > 0
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of lexems in the vocabulary.
|
||||
|
@ -76,7 +81,7 @@ p
|
|||
+cell int / unicode
|
||||
+cell The hash value of a word, or its unicode string.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Lexeme]
|
||||
+cell The lexeme indicated by the given ID.
|
||||
|
@ -90,7 +95,7 @@ p Iterate over the lexemes in the vocabulary.
|
|||
stop_words = (lex for lex in nlp.vocab if lex.is_stop)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code Lexeme]
|
||||
+cell An entry in the vocabulary.
|
||||
|
@ -115,7 +120,7 @@ p
|
|||
+cell unicode
|
||||
+cell The ID string.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the string has an entry in the vocabulary.
|
||||
|
@ -152,11 +157,100 @@ p
|
|||
| which the flag will be stored. If #[code -1], the lowest
|
||||
| available bit will be chosen.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The integer ID by which the flag value can be checked.
|
||||
|
||||
+h(2, "add_flag") Vocab.clear_vectors
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Drop the current vector table. Because all vectors must be the same
|
||||
| width, you have to call this to change the size of the vectors.
|
||||
|
||||
+aside-code("Example").
|
||||
nlp.vocab.clear_vectors(new_dim=300)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code new_dim]
|
||||
+cell int
|
||||
+cell
|
||||
| Number of dimensions of the new vectors. If #[code None], size
|
||||
| is not changed.
|
||||
|
||||
+h(2, "add_flag") Vocab.get_vector
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Retrieve a vector for a word in the vocabulary. Words can be looked up
|
||||
| by string or hash value. If no vectors data is loaded, a
|
||||
| #[code ValueError] is raised.
|
||||
|
||||
+aside-code("Example").
|
||||
nlp.vocab.get_vector(u'apple')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code orth]
|
||||
+cell int / unicode
|
||||
+cell The hash value of a word, or its unicode string.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell
|
||||
| A word vector. Size and shape are determined by the
|
||||
| #[code Vocab.vectors] instance.
|
||||
|
||||
+h(2, "add_flag") Vocab.set_vector
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Set a vector for a word in the vocabulary. Words can be referenced by
|
||||
| by string or hash value.
|
||||
|
||||
+aside-code("Example").
|
||||
nlp.vocab.set_vector(u'apple', array([...]))
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code orth]
|
||||
+cell int / unicode
|
||||
+cell The hash value of a word, or its unicode string.
|
||||
|
||||
+row
|
||||
+cell #[code vector]
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell The vector to set.
|
||||
|
||||
+h(2, "add_flag") Vocab.has_vector
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Check whether a word has a vector. Returns #[code False] if no vectors
|
||||
| are loaded. Words can be looked up by string or hash value.
|
||||
|
||||
+aside-code("Example").
|
||||
if nlp.vocab.has_vector(u'apple'):
|
||||
vector = nlp.vocab.get_vector(u'apple')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code orth]
|
||||
+cell int / unicode
|
||||
+cell The hash value of a word, or its unicode string.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the word has a vector.
|
||||
|
||||
+h(2, "to_disk") Vocab.to_disk
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
@ -192,7 +286,7 @@ p Loads state from a directory. Modifies the object in place and returns it.
|
|||
| A path to a directory. Paths may be either strings or
|
||||
| #[code Path]-like objects.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Vocab]
|
||||
+cell The modified #[code Vocab] object.
|
||||
|
@ -211,7 +305,7 @@ p Serialize the current state to a binary string.
|
|||
+cell -
|
||||
+cell Named attributes to prevent from being serialized.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bytes
|
||||
+cell The serialized form of the #[code Vocab] object.
|
||||
|
@ -238,7 +332,7 @@ p Load state from a binary string.
|
|||
+cell -
|
||||
+cell Named attributes to prevent from being loaded.
|
||||
|
||||
+footrow
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Vocab]
|
||||
+cell The #[code Vocab] object.
|
||||
|
@ -256,3 +350,14 @@ p Load state from a binary string.
|
|||
+cell #[code strings]
|
||||
+cell #[code StringStore]
|
||||
+cell A table managing the string-to-int mapping.
|
||||
|
||||
+row
|
||||
+cell #[code vectors]
|
||||
+tag-new(2)
|
||||
+cell #[code Vectors]
|
||||
+cell A table associating word IDs to word vectors.
|
||||
|
||||
+row
|
||||
+cell #[code vectors_length]
|
||||
+cell int
|
||||
+cell Number of dimensions for each word vector.
|
|
@ -1,156 +0,0 @@
|
|||
//- 💫 DOCS > API > ANNOTATION SPECS
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p This document describes the target annotations spaCy is trained to predict.
|
||||
|
||||
+h(2, "tokenization") Tokenization
|
||||
|
||||
p
|
||||
| Tokenization standards are based on the
|
||||
| #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus.
|
||||
| The tokenizer differs from most by including tokens for significant
|
||||
| whitespace. Any sequence of whitespace characters beyond a single space
|
||||
| (#[code ' ']) is included as a token.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.lang.en import English
|
||||
nlp = English()
|
||||
tokens = nlp('Some\nspaces and\ttab characters')
|
||||
tokens_text = [t.text for t in tokens]
|
||||
assert tokens_text == ['Some', '\n', 'spaces', ' ', 'and',
|
||||
'\t', 'tab', 'characters']
|
||||
|
||||
p
|
||||
| The whitespace tokens are useful for much the same reason punctuation is
|
||||
| – it's often an important delimiter in the text. By preserving it in the
|
||||
| token output, we are able to maintain a simple alignment between the
|
||||
| tokens and the original string, and we ensure that no information is
|
||||
| lost during processing.
|
||||
|
||||
+h(2, "sentence-boundary") Sentence boundary detection
|
||||
|
||||
p
|
||||
| Sentence boundaries are calculated from the syntactic parse tree, so
|
||||
| features such as punctuation and capitalisation play an important but
|
||||
| non-decisive role in determining the sentence boundaries. Usually this
|
||||
| means that the sentence boundaries will at least coincide with clause
|
||||
| boundaries, even given poorly punctuated text.
|
||||
|
||||
+h(2, "pos-tagging") Part-of-speech Tagging
|
||||
|
||||
+aside("Tip: Understanding tags")
|
||||
| You can also use #[code spacy.explain()] to get the description for the
|
||||
| string representation of a tag. For example,
|
||||
| #[code spacy.explain("RB")] will return "adverb".
|
||||
|
||||
include _annotation/_pos-tags
|
||||
|
||||
+h(2, "lemmatization") Lemmatization
|
||||
|
||||
p A "lemma" is the uninflected form of a word. In English, this means:
|
||||
|
||||
+list
|
||||
+item #[strong Adjectives]: The form like "happy", not "happier" or "happiest"
|
||||
+item #[strong Adverbs]: The form like "badly", not "worse" or "worst"
|
||||
+item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
|
||||
+item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
|
||||
|
||||
p
|
||||
| The lemmatization data is taken from
|
||||
| #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
|
||||
| special case for pronouns: all pronouns are lemmatized to the special
|
||||
| token #[code -PRON-].
|
||||
|
||||
+infobox("About spaCy's custom pronoun lemma")
|
||||
| Unlike verbs and common nouns, there's no clear base form of a personal
|
||||
| pronoun. Should the lemma of "me" be "I", or should we normalize person
|
||||
| as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
|
||||
| novel symbol, #[code -PRON-], which is used as the lemma for
|
||||
| all personal pronouns.
|
||||
|
||||
+h(2, "dependency-parsing") Syntactic Dependency Parsing
|
||||
|
||||
+aside("Tip: Understanding labels")
|
||||
| You can also use #[code spacy.explain()] to get the description for the
|
||||
| string representation of a label. For example,
|
||||
| #[code spacy.explain("prt")] will return "particle".
|
||||
|
||||
include _annotation/_dep-labels
|
||||
|
||||
+h(2, "named-entities") Named Entity Recognition
|
||||
|
||||
+aside("Tip: Understanding entity types")
|
||||
| You can also use #[code spacy.explain()] to get the description for the
|
||||
| string representation of an entity label. For example,
|
||||
| #[code spacy.explain("LANGUAGE")] will return "any named language".
|
||||
|
||||
include _annotation/_named-entities
|
||||
|
||||
+h(3, "biluo") BILUO Scheme
|
||||
|
||||
p
|
||||
| spaCy translates character offsets into the BILUO scheme, in order to
|
||||
| decide the cost of each action given the current state of the entity
|
||||
| recognizer. The costs are then used to calculate the gradient of the
|
||||
| loss, to train the model.
|
||||
|
||||
+aside("Why BILUO, not IOB?")
|
||||
| There are several coding schemes for encoding entity annotations as
|
||||
| token tags. These coding schemes are equally expressive, but not
|
||||
| necessarily equally learnable.
|
||||
| #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
|
||||
| showed that the minimal #[strong Begin], #[strong In], #[strong Out]
|
||||
| scheme was more difficult to learn than the #[strong BILUO] scheme that
|
||||
| we use, which explicitly marks boundary tokens.
|
||||
|
||||
+table([ "Tag", "Description" ])
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme B] EGIN]
|
||||
+cell The first token of a multi-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme I] N]
|
||||
+cell An inner token of a multi-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme L] AST]
|
||||
+cell The final token of a multi-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme U] NIT]
|
||||
+cell A single-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme O] UT]
|
||||
+cell A non-entity token.
|
||||
|
||||
+h(2, "json-input") JSON input format for training
|
||||
|
||||
p
|
||||
| spaCy takes training data in the following format:
|
||||
|
||||
+code("Example structure").
|
||||
doc: {
|
||||
id: string,
|
||||
paragraphs: [{
|
||||
raw: string,
|
||||
sents: [int],
|
||||
tokens: [{
|
||||
start: int,
|
||||
tag: string,
|
||||
head: int,
|
||||
dep: string
|
||||
}],
|
||||
ner: [{
|
||||
start: int,
|
||||
end: int,
|
||||
label: string
|
||||
}],
|
||||
brackets: [{
|
||||
start: int,
|
||||
end: int,
|
||||
label: string
|
||||
}]
|
||||
}]
|
||||
}
|
|
@ -1,111 +0,0 @@
|
|||
//- 💫 DOCS > API > DEPENDENCYPARSER
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p Annotate syntactic dependencies on #[code Doc] objects.
|
||||
|
||||
+under-construction
|
||||
|
||||
+h(2, "init") DependencyParser.__init__
|
||||
+tag method
|
||||
|
||||
p Create a #[code DependencyParser].
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code vocab]
|
||||
+cell #[code Vocab]
|
||||
+cell The vocabulary. Must be shared with documents to be processed.
|
||||
|
||||
+row
|
||||
+cell #[code model]
|
||||
+cell #[thinc.linear.AveragedPerceptron]
|
||||
+cell The statistical model.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell #[code DependencyParser]
|
||||
+cell The newly constructed object.
|
||||
|
||||
+h(2, "call") DependencyParser.__call__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Apply the dependency parser, setting the heads and dependency relations
|
||||
| onto the #[code Doc] object.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The document to be processed.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell #[code None]
|
||||
+cell -
|
||||
|
||||
+h(2, "pipe") DependencyParser.pipe
|
||||
+tag method
|
||||
|
||||
p Process a stream of documents.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code stream]
|
||||
+cell -
|
||||
+cell The sequence of documents to process.
|
||||
|
||||
+row
|
||||
+cell #[code batch_size]
|
||||
+cell int
|
||||
+cell The number of documents to accumulate into a working set.
|
||||
|
||||
+row
|
||||
+cell #[code n_threads]
|
||||
+cell int
|
||||
+cell
|
||||
| The number of threads with which to work on the buffer in
|
||||
| parallel.
|
||||
|
||||
+footrow
|
||||
+cell yields
|
||||
+cell #[code Doc]
|
||||
+cell Documents, in order.
|
||||
|
||||
+h(2, "update") DependencyParser.update
|
||||
+tag method
|
||||
|
||||
p Update the statistical model.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The example document for the update.
|
||||
|
||||
+row
|
||||
+cell #[code gold]
|
||||
+cell #[code GoldParse]
|
||||
+cell The gold-standard annotations, to calculate the loss.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The loss on this example.
|
||||
|
||||
+h(2, "step_through") DependencyParser.step_through
|
||||
+tag method
|
||||
|
||||
p Set up a stepwise state, to introspect and control the transition sequence.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The document to step through.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell #[code StepwiseState]
|
||||
+cell A state object, to step through the annotation process.
|
|
@ -1,109 +0,0 @@
|
|||
//- 💫 DOCS > API > ENTITYRECOGNIZER
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p Annotate named entities on #[code Doc] objects.
|
||||
|
||||
+under-construction
|
||||
|
||||
+h(2, "init") EntityRecognizer.__init__
|
||||
+tag method
|
||||
|
||||
p Create an #[code EntityRecognizer].
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code vocab]
|
||||
+cell #[code Vocab]
|
||||
+cell The vocabulary. Must be shared with documents to be processed.
|
||||
|
||||
+row
|
||||
+cell #[code model]
|
||||
+cell #[thinc.linear.AveragedPerceptron]
|
||||
+cell The statistical model.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell #[code EntityRecognizer]
|
||||
+cell The newly constructed object.
|
||||
|
||||
+h(2, "call") EntityRecognizer.__call__
|
||||
+tag method
|
||||
|
||||
p Apply the entity recognizer, setting the NER tags onto the #[code Doc] object.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The document to be processed.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell #[code None]
|
||||
+cell -
|
||||
|
||||
+h(2, "pipe") EntityRecognizer.pipe
|
||||
+tag method
|
||||
|
||||
p Process a stream of documents.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code stream]
|
||||
+cell -
|
||||
+cell The sequence of documents to process.
|
||||
|
||||
+row
|
||||
+cell #[code batch_size]
|
||||
+cell int
|
||||
+cell The number of documents to accumulate into a working set.
|
||||
|
||||
+row
|
||||
+cell #[code n_threads]
|
||||
+cell int
|
||||
+cell
|
||||
| The number of threads with which to work on the buffer in
|
||||
| parallel.
|
||||
|
||||
+footrow
|
||||
+cell yields
|
||||
+cell #[code Doc]
|
||||
+cell Documents, in order.
|
||||
|
||||
+h(2, "update") EntityRecognizer.update
|
||||
+tag method
|
||||
|
||||
p Update the statistical model.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The example document for the update.
|
||||
|
||||
+row
|
||||
+cell #[code gold]
|
||||
+cell #[code GoldParse]
|
||||
+cell The gold-standard annotations, to calculate the loss.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The loss on this example.
|
||||
|
||||
+h(2, "step_through") EntityRecognizer.step_through
|
||||
+tag method
|
||||
|
||||
p Set up a stepwise state, to introspect and control the transition sequence.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The document to step through.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell #[code StepwiseState]
|
||||
+cell A state object, to step through the annotation process.
|
|
@ -1,241 +0,0 @@
|
|||
//- 💫 DOCS > API > FACTS & FIGURES
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
+under-construction
|
||||
|
||||
+h(2, "comparison") Feature comparison
|
||||
|
||||
p
|
||||
| Here's a quick comparison of the functionalities offered by spaCy,
|
||||
| #[+a("https://github.com/tensorflow/models/tree/master/syntaxnet") SyntaxNet],
|
||||
| #[+a("http://www.nltk.org/py-modindex.html") NLTK] and
|
||||
| #[+a("http://stanfordnlp.github.io/CoreNLP/") CoreNLP].
|
||||
|
||||
+table([ "", "spaCy", "SyntaxNet", "NLTK", "CoreNLP"])
|
||||
+row
|
||||
+cell Easy installation
|
||||
each icon in [ "pro", "con", "pro", "pro" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Python API
|
||||
each icon in [ "pro", "con", "pro", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Multi-language support
|
||||
each icon in [ "neutral", "pro", "pro", "pro" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Tokenization
|
||||
each icon in [ "pro", "pro", "pro", "pro" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Part-of-speech tagging
|
||||
each icon in [ "pro", "pro", "pro", "pro" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Sentence segmentation
|
||||
each icon in [ "pro", "pro", "pro", "pro" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Dependency parsing
|
||||
each icon in [ "pro", "pro", "con", "pro" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Entity Recognition
|
||||
each icon in [ "pro", "con", "pro", "pro" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Integrated word vectors
|
||||
each icon in [ "pro", "con", "con", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Sentiment analysis
|
||||
each icon in [ "pro", "con", "pro", "pro" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Coreference resolution
|
||||
each icon in [ "con", "con", "con", "pro" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+h(2, "benchmarks") Benchmarks
|
||||
|
||||
p
|
||||
| Two peer-reviewed papers in 2015 confirm that spaCy offers the
|
||||
| #[strong fastest syntactic parser in the world] and that
|
||||
| #[strong its accuracy is within 1% of the best] available. The few
|
||||
| systems that are more accurate are 20× slower or more.
|
||||
|
||||
+aside("About the evaluation")
|
||||
| The first of the evaluations was published by #[strong Yahoo! Labs] and
|
||||
| #[strong Emory University], as part of a survey of current parsing
|
||||
| technologies #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") (Choi et al., 2015)].
|
||||
| Their results and subsequent discussions helped us develop a novel
|
||||
| psychologically-motivated technique to improve spaCy's accuracy, which
|
||||
| we published in joint work with Macquarie University
|
||||
| #[+a("https://aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].
|
||||
|
||||
+table([ "System", "Language", "Accuracy", "Speed (wps)"])
|
||||
+row
|
||||
each data in [ "spaCy", "Cython", "91.8", "13,963" ]
|
||||
+cell #[strong=data]
|
||||
+row
|
||||
each data in [ "ClearNLP", "Java", "91.7", "10,271" ]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
each data in [ "CoreNLP", "Java", "89.6", "8,602"]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
each data in [ "MATE", "Java", "92.5", "550"]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
each data in [ "Turbo", "C++", "92.4", "349" ]
|
||||
+cell=data
|
||||
|
||||
+h(3, "parse-accuracy") Parse accuracy
|
||||
|
||||
p
|
||||
| In 2016, Google released their
|
||||
| #[+a("https://github.com/tensorflow/models/tree/master/syntaxnet") SyntaxNet]
|
||||
| library, setting a new state of the art for syntactic dependency parsing
|
||||
| accuracy. SyntaxNet's algorithm is very similar to spaCy's. The main
|
||||
| difference is that SyntaxNet uses a neural network while spaCy uses a
|
||||
| sparse linear model.
|
||||
|
||||
+aside("Methodology")
|
||||
| #[+a("http://arxiv.org/abs/1603.06042") Andor et al. (2016)] chose
|
||||
| slightly different experimental conditions from
|
||||
| #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") Choi et al. (2015)],
|
||||
| so the two accuracy tables here do not present directly comparable
|
||||
| figures. We have only evaluated spaCy in the "News" condition following
|
||||
| the SyntaxNet methodology. We don't yet have benchmark figures for the
|
||||
| "Web" and "Questions" conditions.
|
||||
|
||||
+table([ "System", "News", "Web", "Questions" ])
|
||||
+row
|
||||
+cell spaCy
|
||||
each data in [ 92.8, "n/a", "n/a" ]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
+cell #[+a("https://github.com/tensorflow/models/tree/master/syntaxnet") Parsey McParseface]
|
||||
each data in [ 94.15, 89.08, 94.77 ]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
+cell #[+a("http://www.cs.cmu.edu/~ark/TurboParser/") Martins et al. (2013)]
|
||||
each data in [ 93.10, 88.23, 94.21 ]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
+cell #[+a("http://research.google.com/pubs/archive/38148.pdf") Zhang and McDonald (2014)]
|
||||
each data in [ 93.32, 88.65, 93.37 ]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
+cell #[+a("http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43800.pdf") Weiss et al. (2015)]
|
||||
each data in [ 93.91, 89.29, 94.17 ]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
+cell #[strong #[+a("http://arxiv.org/abs/1603.06042") Andor et al. (2016)]]
|
||||
each data in [ 94.44, 90.17, 95.40 ]
|
||||
+cell #[strong=data]
|
||||
|
||||
+h(3, "speed-comparison") Detailed speed comparison
|
||||
|
||||
p
|
||||
| Here we compare the per-document processing time of various spaCy
|
||||
| functionalities against other NLP libraries. We show both absolute
|
||||
| timings (in ms) and relative performance (normalized to spaCy). Lower is
|
||||
| better.
|
||||
|
||||
+aside("Methodology")
|
||||
| #[strong Set up:] 100,000 plain-text documents were streamed from an
|
||||
| SQLite3 database, and processed with an NLP library, to one of three
|
||||
| levels of detail — tokenization, tagging, or parsing. The tasks are
|
||||
| additive: to parse the text you have to tokenize and tag it. The
|
||||
| pre-processing was not subtracted from the times — I report the time
|
||||
| required for the pipeline to complete. I report mean times per document,
|
||||
| in milliseconds.#[br]#[br]
|
||||
| #[strong Hardware]: Intel i7-3770 (2012)#[br]
|
||||
| #[strong Implementation]: #[+src(gh("spacy-benchmarks")) spacy-benchmarks]
|
||||
|
||||
+table
|
||||
+row.u-text-label.u-text-center
|
||||
th.c-table__head-cell
|
||||
th.c-table__head-cell(colspan="3") Absolute (ms per doc)
|
||||
th.c-table__head-cell(colspan="3") Relative (to spaCy)
|
||||
|
||||
+row
|
||||
each column in ["System", "Tokenize", "Tag", "Parse", "Tokenize", "Tag", "Parse"]
|
||||
th.c-table__head-cell.u-text-label=column
|
||||
|
||||
+row
|
||||
+cell #[strong spaCy]
|
||||
each data in [ "0.2ms", "1ms", "19ms"]
|
||||
+cell #[strong=data]
|
||||
|
||||
each data in [ "1x", "1x", "1x" ]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
each data in [ "CoreNLP", "2ms", "10ms", "49ms", "10x", "10x", "2.6x"]
|
||||
+cell=data
|
||||
+row
|
||||
each data in [ "ZPar", "1ms", "8ms", "850ms", "5x", "8x", "44.7x" ]
|
||||
+cell=data
|
||||
+row
|
||||
each data in [ "NLTK", "4ms", "443ms", "n/a", "20x", "443x", "n/a" ]
|
||||
+cell=data
|
||||
|
||||
+h(3, "ner") Named entity comparison
|
||||
|
||||
p
|
||||
| #[+a("https://aclweb.org/anthology/W/W16/W16-2703.pdf") Jiang et al. (2016)]
|
||||
| present several detailed comparisons of the named entity recognition
|
||||
| models provided by spaCy, CoreNLP, NLTK and LingPipe. Here we show their
|
||||
| evaluation of person, location and organization accuracy on Wikipedia.
|
||||
|
||||
+aside("Methodology")
|
||||
| Making a meaningful comparison of different named entity recognition
|
||||
| systems is tricky. Systems are often trained on different data, which
|
||||
| usually have slight differences in annotation style. For instance, some
|
||||
| corpora include titles as part of person names, while others don't.
|
||||
| These trivial differences in convention can distort comparisons
|
||||
| significantly. Jiang et al.'s #[em partial overlap] metric goes a long
|
||||
| way to solving this problem.
|
||||
|
||||
+table([ "System", "Precision", "Recall", "F-measure" ])
|
||||
+row
|
||||
+cell spaCy
|
||||
each data in [ 0.7240, 0.6514, 0.6858 ]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
+cell #[strong CoreNLP]
|
||||
each data in [ 0.7914, 0.7327, 0.7609 ]
|
||||
+cell #[strong=data]
|
||||
|
||||
+row
|
||||
+cell NLTK
|
||||
each data in [ 0.5136, 0.6532, 0.5750 ]
|
||||
+cell=data
|
||||
|
||||
+row
|
||||
+cell LingPipe
|
||||
each data in [ 0.5412, 0.5357, 0.5384 ]
|
||||
+cell=data
|
|
@ -1,93 +0,0 @@
|
|||
//- 💫 DOCS > API > LANGUAGE MODELS
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p
|
||||
| spaCy currently provides models for the following languages and
|
||||
| capabilities:
|
||||
|
||||
|
||||
+aside-code("Download language models", "bash").
|
||||
spacy download en
|
||||
spacy download de
|
||||
spacy download fr
|
||||
|
||||
+table([ "Language", "Token", "SBD", "Lemma", "POS", "NER", "Dep", "Vector", "Sentiment"])
|
||||
+row
|
||||
+cell English #[code en]
|
||||
each icon in [ "pro", "pro", "pro", "pro", "pro", "pro", "pro", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell German #[code de]
|
||||
each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell French #[code fr]
|
||||
each icon in [ "pro", "con", "con", "pro", "con", "pro", "pro", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Spanish #[code es]
|
||||
each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
p
|
||||
+button("/docs/usage/models", true, "primary") See available models
|
||||
|
||||
+h(2, "alpha-support") Alpha tokenization support
|
||||
|
||||
p
|
||||
| Work has started on the following languages. You can help by
|
||||
| #[+a("/docs/usage/adding-languages#language-data") improving the existing language data]
|
||||
| and extending the tokenization patterns.
|
||||
|
||||
+aside("Usage note")
|
||||
| Note that the alpha languages don't yet come with a language model. In
|
||||
| order to use them, you have to import them directly:
|
||||
|
||||
+code.o-no-block.
|
||||
from spacy.lang.fi import Finnish
|
||||
nlp = Finnish()
|
||||
doc = nlp(u'Ilmatyynyalukseni on täynnä ankeriaita')
|
||||
|
||||
+infobox("Dependencies")
|
||||
| Some language tokenizers require external dependencies. To use #[strong Chinese],
|
||||
| you need to have #[+a("https://github.com/fxsjy/jieba") Jieba] installed.
|
||||
| The #[strong Japanese] tokenizer requires
|
||||
| #[+a("https://github.com/mocobeta/janome") Janome].
|
||||
|
||||
+table([ "Language", "Code", "Source" ])
|
||||
each language, code in { it: "Italian", pt: "Portuguese", nl: "Dutch", sv: "Swedish", fi: "Finnish", nb: "Norwegian Bokmål", da: "Danish", hu: "Hungarian", pl: "Polish", bn: "Bengali", he: "Hebrew", zh: "Chinese", ja: "Japanese" }
|
||||
+row
|
||||
+cell #{language}
|
||||
+cell #[code=code]
|
||||
+cell
|
||||
+src(gh("spaCy", "spacy/lang/" + code)) lang/#{code}
|
||||
|
||||
+h(2, "multi-language") Multi-language support
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| As of v2.0, spaCy supports models trained on more than one language. This
|
||||
| is especially useful for named entity recognition. The language ID used
|
||||
| for multi-language or language-neutral models is #[code xx]. The
|
||||
| language class, a generic subclass containing only the base language data,
|
||||
| can be found in #[+src(gh("spaCy", "spacy/lang/xx")) lang/xx].
|
||||
|
||||
p
|
||||
| To load your model with the neutral, multi-language class, simply set
|
||||
| #[code "language": "xx"] in your
|
||||
| #[+a("/docs/usage/saving-loading#models-generating") model package]'s
|
||||
| meta.json. You can also import the class directly, or call
|
||||
| #[+api("util#get_lang_class") #[code util.get_lang_class()]] for
|
||||
| lazy-loading.
|
||||
|
||||
+code("Standard import").
|
||||
from spacy.lang.xx import MultiLanguage
|
||||
nlp = MultiLanguage()
|
||||
|
||||
+code("With lazy-loading").
|
||||
from spacy.util import get_lang_class
|
||||
nlp = get_lang_class('xx')
|
|
@ -1,93 +0,0 @@
|
|||
//- 💫 DOCS > API > TAGGER
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p Annotate part-of-speech tags on #[code Doc] objects.
|
||||
|
||||
+under-construction
|
||||
|
||||
+h(2, "init") Tagger.__init__
|
||||
+tag method
|
||||
|
||||
p Create a #[code Tagger].
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code vocab]
|
||||
+cell #[code Vocab]
|
||||
+cell The vocabulary. Must be shared with documents to be processed.
|
||||
|
||||
+row
|
||||
+cell #[code model]
|
||||
+cell #[thinc.linear.AveragedPerceptron]
|
||||
+cell The statistical model.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell #[code Tagger]
|
||||
+cell The newly constructed object.
|
||||
|
||||
+h(2, "call") Tagger.__call__
|
||||
+tag method
|
||||
|
||||
p Apply the tagger, setting the POS tags onto the #[code Doc] object.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The tokens to be tagged.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell #[code None]
|
||||
+cell -
|
||||
|
||||
+h(2, "pipe") Tagger.pipe
|
||||
+tag method
|
||||
|
||||
p Tag a stream of documents.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code stream]
|
||||
+cell -
|
||||
+cell The sequence of documents to tag.
|
||||
|
||||
+row
|
||||
+cell #[code batch_size]
|
||||
+cell int
|
||||
+cell The number of documents to accumulate into a working set.
|
||||
|
||||
+row
|
||||
+cell #[code n_threads]
|
||||
+cell int
|
||||
+cell
|
||||
| The number of threads with which to work on the buffer in
|
||||
| parallel.
|
||||
|
||||
+footrow
|
||||
+cell yields
|
||||
+cell #[code Doc]
|
||||
+cell Documents, in order.
|
||||
|
||||
+h(2, "update") Tagger.update
|
||||
+tag method
|
||||
|
||||
p Update the statistical model, with tags supplied for the given document.
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code doc]
|
||||
+cell #[code Doc]
|
||||
+cell The example document for the update.
|
||||
|
||||
+row
|
||||
+cell #[code gold]
|
||||
+cell #[code GoldParse]
|
||||
+cell Manager for the gold-standard tags.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell Number of tags predicted correctly.
|
|
@ -1,7 +0,0 @@
|
|||
//- 💫 DOCS > API > TENSORIZER
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p Add a tensor with position-sensitive meaning representations to a #[code Doc].
|
||||
|
||||
+under-construction
|
|
@ -1,21 +0,0 @@
|
|||
//- 💫 DOCS > API > TEXTCATEGORIZER
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p
|
||||
| Add text categorization models to spaCy pipelines. The model supports
|
||||
| classification with multiple, non-mutually exclusive labels.
|
||||
|
||||
p
|
||||
| You can change the model architecture rather easily, but by default, the
|
||||
| #[code TextCategorizer] class uses a convolutional neural network to
|
||||
| assign position-sensitive vectors to each word in the document. This step
|
||||
| is similar to the #[+api("tensorizer") #[code Tensorizer]] component, but the
|
||||
| #[code TextCategorizer] uses its own CNN model, to avoid sharing weights
|
||||
| with the other pipeline components. The document tensor is then
|
||||
| summarized by concatenating max and mean pooling, and a multilayer
|
||||
| perceptron is used to predict an output vector of length #[code nr_class],
|
||||
| before a logistic activation is applied elementwise. The value of each
|
||||
| output neuron is the probability that some class is present.
|
||||
|
||||
+under-construction
|
|
@ -1,7 +0,0 @@
|
|||
//- 💫 DOCS > API > VECTORS
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p A container class for vector data keyed by string.
|
||||
|
||||
+under-construction
|
72
website/usage/_models/_languages.jade
Normal file
72
website/usage/_models/_languages.jade
Normal file
|
@ -0,0 +1,72 @@
|
|||
//- 💫 DOCS > USAGE > MODELS > LANGUAGE SUPPORT
|
||||
|
||||
p spaCy currently provides models for the following languages:
|
||||
|
||||
+table(["Language", "Code", "Language data", "Models"])
|
||||
for models, code in MODELS
|
||||
- var count = Object.keys(models).length
|
||||
+row
|
||||
+cell=LANGUAGES[code]
|
||||
+cell #[code=code]
|
||||
+cell
|
||||
+src(gh("spaCy", "spacy/lang/" + code)) #[code lang/#{code}]
|
||||
+cell
|
||||
+a("/models/" + code) #{count} #{(count == 1) ? "model" : "models"}
|
||||
|
||||
+h(3, "alpha-support") Alpha tokenization support
|
||||
|
||||
p
|
||||
| Work has started on the following languages. You can help by
|
||||
| #[+a("/usage/adding-languages#language-data") improving the existing language data]
|
||||
| and extending the tokenization patterns.
|
||||
|
||||
+aside("Usage note")
|
||||
| Note that the alpha languages don't yet come with a language model. In
|
||||
| order to use them, you have to import them directly, or use
|
||||
| #[+api("spacy#blank") #[code spacy.blank]]:
|
||||
|
||||
+code.o-no-block.
|
||||
from spacy.lang.fi import Finnish
|
||||
nlp = Finnish() # use directly
|
||||
nlp = spacy.blank('fi') # blank instance
|
||||
|
||||
+table(["Language", "Code", "Language data"])
|
||||
for lang, code in LANGUAGES
|
||||
if !Object.keys(MODELS).includes(code)
|
||||
+row
|
||||
+cell #{LANGUAGES[code]}
|
||||
+cell #[code=code]
|
||||
+cell
|
||||
+src(gh("spaCy", "spacy/lang/" + code)) #[code lang/#{code}]
|
||||
|
||||
+infobox("Dependencies")
|
||||
| Some language tokenizers require external dependencies. To use #[strong Chinese],
|
||||
| you need to have #[+a("https://github.com/fxsjy/jieba") Jieba] installed.
|
||||
| The #[strong Japanese] tokenizer requires
|
||||
| #[+a("https://github.com/mocobeta/janome") Janome].
|
||||
|
||||
+h(3, "multi-language") Multi-language support
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| As of v2.0, spaCy supports models trained on more than one language. This
|
||||
| is especially useful for named entity recognition. The language ID used
|
||||
| for multi-language or language-neutral models is #[code xx]. The
|
||||
| language class, a generic subclass containing only the base language data,
|
||||
| can be found in #[+src(gh("spaCy", "spacy/lang/xx")) #[code lang/xx]].
|
||||
|
||||
p
|
||||
| To load your model with the neutral, multi-language class, simply set
|
||||
| #[code "language": "xx"] in your
|
||||
| #[+a("/usage/training#models-generating") model package]'s
|
||||
| meta.json. You can also import the class directly, or call
|
||||
| #[+api("util#get_lang_class") #[code util.get_lang_class()]] for
|
||||
| lazy-loading.
|
||||
|
||||
+code("Standard import").
|
||||
from spacy.lang.xx import MultiLanguage
|
||||
nlp = MultiLanguage()
|
||||
|
||||
+code("With lazy-loading").
|
||||
from spacy.util import get_lang_class
|
||||
nlp = get_lang_class('xx')
|
Loading…
Reference in New Issue
Block a user