Update API documentation

This commit is contained in:
ines 2017-10-03 14:27:22 +02:00
parent 3f4fd2c5d5
commit 808f7ee417
46 changed files with 2070 additions and 1141 deletions

View File

@ -0,0 +1,43 @@
//- 💫 DOCS > API > ANNOTATION > BILUO
+table([ "Tag", "Description" ])
+row
+cell #[code #[span.u-color-theme B] EGIN]
+cell The first token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme I] N]
+cell An inner token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme L] AST]
+cell The final token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme U] NIT]
+cell A single-token entity.
+row
+cell #[code #[span.u-color-theme O] UT]
+cell A non-entity token.
+aside("Why BILUO, not IOB?")
| There are several coding schemes for encoding entity annotations as
| token tags. These coding schemes are equally expressive, but not
| necessarily equally learnable.
| #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
| showed that the minimal #[strong Begin], #[strong In], #[strong Out]
| scheme was more difficult to learn than the #[strong BILUO] scheme that
| we use, which explicitly marks boundary tokens.
p
| spaCy translates the character offsets into this scheme, in order to
| decide the cost of each action given the current state of the entity
| recogniser. The costs are then used to calculate the gradient of the
| loss, to train the model. The exact algorithm is a pastiche of
| well-known methods, and is not currently described in any single
| publication. The model is a greedy transition-based parser guided by a
| linear model whose weights are learned using the averaged perceptron
| loss, via the #[+a("http://www.aclweb.org/anthology/C12-1059") dynamic oracle]
| imitation learning strategy. The transition system is equivalent to the
| BILOU tagging scheme.

View File

@ -0,0 +1,115 @@
//- 💫 DOCS > API > ARCHITECTURE > CYTHON
+aside("What's Cython?")
| #[+a("http://cython.org/") Cython] is a language for writing
| C extensions for Python. Most Python code is also valid Cython, but
| you can add type declarations to get efficient memory-managed code
| just like C or C++.
p
| spaCy's core data structures are implemented as
| #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
| managed through the #[+a(gh("cymem")) #[code cymem]]
| #[code cymem.Pool] class, which allows you
| to allocate memory which will be freed when the #[code Pool] object
| is garbage collected. This means you usually don't have to worry
| about freeing memory. You just have to decide which Python object
| owns the memory, and make it own the #[code Pool]. When that object
| goes out of scope, the memory will be freed. You do have to take
| care that no pointers outlive the object that owns them — but this
| is generally quite easy.
p
| All Cython modules should have the #[code # cython: infer_types=True]
| compiler directive at the top of the file. This makes the code much
| cleaner, as it avoids the need for many type declarations. If
| possible, you should prefer to declare your functions #[code nogil],
| even if you don't especially care about multi-threading. The reason
| is that #[code nogil] functions help the Cython compiler reason about
| your code quite a lot — you're telling the compiler that no Python
| dynamics are possible. This lets many errors be raised, and ensures
| your function will run at C speed.
p
| Cython gives you many choices of sequences: you could have a Python
| list, a numpy array, a memory view, a C++ vector, or a pointer.
| Pointers are preferred, because they are fastest, have the most
| explicit semantics, and let the compiler check your code more
| strictly. C++ vectors are also great — but you should only use them
| internally in functions. It's less friendly to accept a vector as an
| argument, because that asks the user to do much more work. Here's
| how to get a pointer from a numpy array, memory view or vector:
+code.
cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
pointer1 = <int*>numpy_array.data
pointer2 = cpp_vector.data()
pointer3 = &memory_view[0]
p
| Both C arrays and C++ vectors reassure the compiler that no Python
| operations are possible on your variable. This is a big advantage:
| it lets the Cython compiler raise many more errors for you.
p
| When getting a pointer from a numpy array or memoryview, take care
| that the data is actually stored in C-contiguous order — otherwise
| you'll get a pointer to nonsense. The type-declarations in the code
| above should generate runtime errors if buffers with incorrect
| memory layouts are passed in. To iterate over the array, the
| following style is preferred:
+code.
cdef int c_total(const int* int_array, int length) nogil:
total = 0
for item in int_array[:length]:
total += item
return total
p
| If this is confusing, consider that the compiler couldn't deal with
| #[code for item in int_array:] — there's no length attached to a raw
| pointer, so how could we figure out where to stop? The length is
| provided in the slice notation as a solution to this. Note that we
| don't have to declare the type of #[code item] in the code above —
| the compiler can easily infer it. This gives us tidy code that looks
| quite like Python, but is exactly as fast as C — because we've made
| sure the compilation to C is trivial.
p
| Your functions cannot be declared #[code nogil] if they need to
| create Python objects or call Python functions. This is perfectly
| okay — you shouldn't torture your code just to get #[code nogil]
| functions. However, if your function isn't #[code nogil], you should
| compile your module with #[code cython -a --cplus my_module.pyx] and
| open the resulting #[code my_module.html] file in a browser. This
| will let you see how Cython is compiling your code. Calls into the
| Python run-time will be in bright yellow. This lets you easily see
| whether Cython is able to correctly type your code, or whether there
| are unexpected problems.
p
| Working in Cython is very rewarding once you're over the initial
| learning curve. As with C and C++, the first way you write something
| in Cython will often be the performance-optimal approach. In
| contrast, Python optimisation generally requires a lot of
| experimentation. Is it faster to have an #[code if item in my_dict]
| check, or to use #[code .get()]? What about
| #[code try]/#[code except]? Does this numpy operation create a copy?
| There's no way to guess the answers to these questions, and you'll
| usually be dissatisfied with your results — so there's no way to
| know when to stop this process. In the worst case, you'll make a
| mess that invites the next reader to try their luck too. This is
| like one of those
| #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
| where the rescuers keep passing out from low oxygen, causing
| another rescuer to follow — only to succumb themselves. In short,
| just say no to optimizing your Python. If it's not fast enough the
| first time, just switch to Cython.
+infobox("Resources")
+list.o-no-block
+item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
+item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
+item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCys parser and named entity recogniser] (explosion.ai)

View File

@ -0,0 +1,141 @@
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
p
| The parsing model is a blend of recent results. The two recent
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
| Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
| the parser is still based on the work of Joakim Nivre#[+fn(2)], who
| introduced the transition-based framework#[+fn(3)], the arc-eager
| transition system, and the imitation learning objective. The model is
| implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
| library. We first predict context-sensitive vectors for each word in the
| input:
+code.
(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4
p
| This convolutional layer is shared between the tagger, parser and NER,
| and will also be shared by the future neural lemmatizer. Because the
| parser shares these layers with the tagger, the parser does not require
| tag features. I got this trick from David Weiss's "Stack Combination"
| paper#[+fn(4)].
p
| To boost the representation, the tagger actually predicts a "super tag"
| with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
| these supertags by adding a softmax layer onto the convolutional layer
| so, we're teaching the convolutional layer to give us a representation
| that's one affine transform from this informative lexical information.
| This is obviously good for the parser (which backprops to the
| convolutions too). The parser model makes a state vector by concatenating
| the vector representations for its context tokens. The current context
| tokens:
+table
+row
+cell #[code S0], #[code S1], #[code S2]
+cell Top three words on the stack.
+row
+cell #[code B0], #[code B1]
+cell First two words of the buffer.
+row
+cell.u-nowrap
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
| #[code B1L1]#[br]
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
| #[code B1L2]
+cell
| Leftmost and second leftmost children of #[code S0], #[code S1],
| #[code S2], #[code B0] and #[code B1].
+row
+cell.u-nowrap
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
| #[code B1R1]#[br]
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
| #[code B1R2]
+cell
| Rightmost and second rightmost children of #[code S0], #[code S1],
| #[code S2], #[code B0] and #[code B1].
p
| This makes the state vector quite long: #[code 13*T], where #[code T] is
| the token vector width (128 is working well). Fortunately, there's a way
| to structure the computation to save some expense (and make it more
| GPU-friendly).
p
| The parser typically visits #[code 2*N] states for a sentence of length
| #[code N] (although it may visit more, if it back-tracks with a
| non-monotonic transition#[+fn(4)]). A naive implementation would require
| #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
| size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
| multiplication, to pre-compute the hidden weights for each positional
| feature with respect to the words in the batch. (Note that our token
| vectors come from the CNN — so we can't play this trick over the
| vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
| model is so big.)
p
| This pre-computation strategy allows a nice compromise between
| GPU-friendliness and implementation simplicity. The CNN and the wide
| lower layer are computed on the GPU, and then the precomputed hidden
| weights are moved to the CPU, before we start the transition-based
| parsing process. This makes a lot of things much easier. We don't have to
| worry about variable-length batch sizes, and we don't have to implement
| the dynamic oracle in CUDA to train.
p
| Currently the parser's loss function is multilabel log loss#[+fn(6)], as
| the dynamic oracle allows multiple states to be 0 cost. This is defined
| as follows, where #[code gZ] is the sum of the scores assigned to gold
| classes:
+code.
(exp(score) / Z) - (exp(score) / gZ)
+bibliography
+item
| #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
br
| Eliyahu Kiperwasser, Yoav Goldberg. (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
br
| Yoav Goldberg, Joakim Nivre (2012)
+item
| #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
br
| Matthew Honnibal (2013)
+item
| #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
br
| Yuan Zhang, David Weiss (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
br
| Anders Søgaard, Yoav Goldberg (2016)
+item
| #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
br
| Matthew Honnibal, Mark Johnson (2015)
+item
| #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
br
| Danqi Cheng, Christopher D. Manning (2014)
+item
| #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
br
| Stefan Riezler et al. (2002)

View File

@ -1,29 +1,32 @@
{
"sidebar": {
"Introduction": {
"Facts & Figures": "./",
"Languages": "language-models",
"Annotation Specs": "annotation"
"Overview": {
"Architecture": "./",
"Annotation Specs": "annotation",
"Functions": "top-level"
},
"Top-level": {
"spacy": "spacy",
"displacy": "displacy",
"Utility Functions": "util",
"Command line": "cli"
},
"Classes": {
"Containers": {
"Doc": "doc",
"Token": "token",
"Span": "span",
"Lexeme": "lexeme"
},
"Pipeline": {
"Language": "language",
"Tokenizer": "tokenizer",
"Pipe": "pipe",
"Tensorizer": "tensorizer",
"Tagger": "tagger",
"DependencyParser": "dependencyparser",
"EntityRecognizer": "entityrecognizer",
"TextCategorizer": "textcategorizer",
"Tokenizer": "tokenizer",
"Lemmatizer": "lemmatizer",
"Matcher": "matcher",
"Lexeme": "lexeme",
"PhraseMatcher": "phrasematcher"
},
"Other": {
"Vocab": "vocab",
"StringStore": "stringstore",
"Vectors": "vectors",
@ -34,52 +37,37 @@
},
"index": {
"title": "Facts & Figures",
"next": "language-models"
"title": "Architecture",
"next": "annotation",
"menu": {
"Basics": "basics",
"Neural Network Model": "nn-model",
"Cython Conventions": "cython"
}
},
"language-models": {
"title": "Languages",
"next": "philosophy"
},
"philosophy": {
"title": "Philosophy"
},
"spacy": {
"title": "spaCy top-level functions",
"source": "spacy/__init__.py",
"next": "displacy"
},
"displacy": {
"title": "displaCy",
"tag": "module",
"source": "spacy/displacy",
"next": "util"
},
"util": {
"title": "Utility Functions",
"source": "spacy/util.py",
"next": "cli"
},
"cli": {
"title": "Command Line Interface",
"source": "spacy/cli"
"top-level": {
"title": "Top-level Functions",
"menu": {
"spacy": "spacy",
"displacy": "displacy",
"Utility Functions": "util",
"Compatibility": "compat",
"Command Line": "cli"
}
},
"language": {
"title": "Language",
"tag": "class",
"teaser": "A text-processing pipeline.",
"source": "spacy/language.py"
},
"doc": {
"title": "Doc",
"tag": "class",
"teaser": "A container for accessing linguistic annotations.",
"source": "spacy/tokens/doc.pyx"
},
@ -103,6 +91,7 @@
"vocab": {
"title": "Vocab",
"teaser": "A storage class for vocabulary and other data shared across a language.",
"tag": "class",
"source": "spacy/vocab.pyx"
},
@ -115,10 +104,27 @@
"matcher": {
"title": "Matcher",
"teaser": "Match sequences of tokens, based on pattern rules.",
"tag": "class",
"source": "spacy/matcher.pyx"
},
"phrasematcher": {
"title": "PhraseMatcher",
"teaser": "Match sequences of tokens, based on documents.",
"tag": "class",
"tag_new": 2,
"source": "spacy/matcher.pyx"
},
"pipe": {
"title": "Pipe",
"teaser": "Abstract base class defining the API for pipeline components.",
"tag": "class",
"tag_new": 2,
"source": "spacy/pipeline.pyx"
},
"dependenyparser": {
"title": "DependencyParser",
"tag": "class",
@ -127,18 +133,22 @@
"entityrecognizer": {
"title": "EntityRecognizer",
"teaser": "Annotate named entities on documents.",
"tag": "class",
"source": "spacy/pipeline.pyx"
},
"textcategorizer": {
"title": "TextCategorizer",
"teaser": "Add text categorization models to spaCy pipelines.",
"tag": "class",
"tag_new": 2,
"source": "spacy/pipeline.pyx"
},
"dependencyparser": {
"title": "DependencyParser",
"teaser": "Annotate syntactic dependencies on documents.",
"tag": "class",
"source": "spacy/pipeline.pyx"
},
@ -149,15 +159,23 @@
"source": "spacy/tokenizer.pyx"
},
"lemmatizer": {
"title": "Lemmatizer",
"tag": "class"
},
"tagger": {
"title": "Tagger",
"teaser": "Annotate part-of-speech tags on documents.",
"tag": "class",
"source": "spacy/pipeline.pyx"
},
"tensorizer": {
"title": "Tensorizer",
"teaser": "Add a tensor with position-sensitive meaning representations to a document.",
"tag": "class",
"tag_new": 2,
"source": "spacy/pipeline.pyx"
},
@ -169,23 +187,38 @@
"goldcorpus": {
"title": "GoldCorpus",
"teaser": "An annotated corpus, using the JSON file format.",
"tag": "class",
"tag_new": 2,
"source": "spacy/gold.pyx"
},
"binder": {
"title": "Binder",
"tag": "class",
"tag_new": 2,
"source": "spacy/tokens/binder.pyx"
},
"vectors": {
"title": "Vectors",
"teaser": "Store, save and load word vectors.",
"tag": "class",
"tag_new": 2,
"source": "spacy/vectors.pyx"
},
"annotation": {
"title": "Annotation Specifications"
"title": "Annotation Specifications",
"teaser": "Schemes used for labels, tags and training data.",
"menu": {
"Tokenization": "tokenization",
"Sentence Boundaries": "sbd",
"POS Tagging": "pos-tagging",
"Lemmatization": "lemmatization",
"Dependencies": "dependency-parsing",
"Named Entities": "named-entities",
"Training Data": "training"
}
}
}

View File

@ -1,26 +1,17 @@
//- 💫 DOCS > USAGE > COMMAND LINE INTERFACE
include ../../_includes/_mixins
//- 💫 DOCS > API > TOP-LEVEL > COMMAND LINE INTERFACE
p
| As of v1.7.0, spaCy comes with new command line helpers to download and
| link models and show useful debugging information. For a list of available
| commands, type #[code spacy --help].
+infobox("⚠️ Deprecation note")
| As of spaCy 2.0, the #[code model] command to initialise a model data
| directory is deprecated. The command was only necessary because previous
| versions of spaCy expected a model directory to already be set up. This
| has since been changed, so you can use the #[+api("cli#train") #[code train]]
| command straight away.
+h(2, "download") Download
+h(3, "download") Download
p
| Download #[+a("/docs/usage/models") models] for spaCy. The downloader finds the
| Download #[+a("/usage/models") models] for spaCy. The downloader finds the
| best-matching compatible version, uses pip to download the model as a
| package and automatically creates a
| #[+a("/docs/usage/models#usage") shortcut link] to load the model by name.
| #[+a("/usage/models#usage") shortcut link] to load the model by name.
| Direct downloads don't perform any compatibility checks and require the
| model name to be specified with its version (e.g., #[code en_core_web_sm-1.2.0]).
@ -49,15 +40,15 @@ p
| detailed messages in case things go wrong. It's #[strong not recommended]
| to use this command as part of an automated process. If you know which
| model your project needs, you should consider a
| #[+a("/docs/usage/models#download-pip") direct download via pip], or
| #[+a("/usage/models#download-pip") direct download via pip], or
| uploading the model to a local PyPi installation and fetching it straight
| from there. This will also allow you to add it as a versioned package
| dependency to your project.
+h(2, "link") Link
+h(3, "link") Link
p
| Create a #[+a("/docs/usage/models#usage") shortcut link] for a model,
| Create a #[+a("/usage/models#usage") shortcut link] for a model,
| either a Python package or a local directory. This will let you load
| models from any location using a custom name via
| #[+api("spacy#load") #[code spacy.load()]].
@ -95,7 +86,7 @@ p
+cell flag
+cell Show help message and available arguments.
+h(2, "info") Info
+h(3, "info") Info
p
| Print information about your spaCy installation, models and local setup,
@ -122,15 +113,15 @@ p
+cell flag
+cell Show help message and available arguments.
+h(2, "convert") Convert
+h(3, "convert") Convert
p
| Convert files into spaCy's #[+a("/docs/api/annotation#json-input") JSON format]
| Convert files into spaCy's #[+a("/api/annotation#json-input") JSON format]
| for use with the #[code train] command and other experiment management
| functions. The right converter is chosen based on the file extension of
| the input file. Currently only supports #[code .conllu].
+code(false, "bash", "$").
+code(false, "bash", "$", false, false, true).
spacy convert [input_file] [output_dir] [--n-sents] [--morphology]
+table(["Argument", "Type", "Description"])
@ -159,14 +150,18 @@ p
+cell flag
+cell Show help message and available arguments.
+h(2, "train") Train
+h(3, "train") Train
p
| Train a model. Expects data in spaCy's
| #[+a("/docs/api/annotation#json-input") JSON format].
| #[+a("/api/annotation#json-input") JSON format]. On each epoch, a model
| will be saved out to the directory. Accuracy scores and model details
| will be added to a #[+a("/usage/training#models-generating") #[code meta.json]]
| to allow packaging the model using the
| #[+api("cli#package") #[code package]] command.
+code(false, "bash", "$").
spacy train [lang] [output_dir] [train_data] [dev_data] [--n-iter] [--n-sents] [--use-gpu] [--no-tagger] [--no-parser] [--no-entities]
+code(false, "bash", "$", false, false, true).
spacy train [lang] [output_dir] [train_data] [dev_data] [--n-iter] [--n-sents] [--use-gpu] [--meta-path] [--vectors] [--no-tagger] [--no-parser] [--no-entities] [--gold-preproc]
+table(["Argument", "Type", "Description"])
+row
@ -204,6 +199,27 @@ p
+cell option
+cell Use GPU.
+row
+cell #[code --vectors], #[code -v]
+cell option
+cell Model to load vectors from.
+row
+cell #[code --meta-path], #[code -m]
+cell option
+cell
| #[+tag-new(2)] Optional path to model
| #[+a("/usage/training#models-generating") #[code meta.json]].
| All relevant properties like #[code lang], #[code pipeline] and
| #[code spacy_version] will be overwritten.
+row
+cell #[code --version], #[code -V]
+cell option
+cell
| Model version. Will be written out to the model's
| #[code meta.json] after training.
+row
+cell #[code --no-tagger], #[code -T]
+cell flag
@ -219,12 +235,18 @@ p
+cell flag
+cell Don't train NER.
+row
+cell #[code --gold-preproc], #[code -G]
+cell flag
+cell Use gold preprocessing.
+row
+cell #[code --help], #[code -h]
+cell flag
+cell Show help message and available arguments.
+h(3, "train-hyperparams") Environment variables for hyperparameters
+h(4, "train-hyperparams") Environment variables for hyperparameters
+tag-new(2)
p
| spaCy lets you set hyperparameters for training via environment variables.
@ -236,98 +258,96 @@ p
+code(false, "bash").
parser_hidden_depth=2 parser_maxout_pieces=1 train-parser
+under-construction
+table(["Name", "Description", "Default"])
+row
+cell #[code dropout_from]
+cell
+cell Initial dropout rate.
+cell #[code 0.2]
+row
+cell #[code dropout_to]
+cell
+cell Final dropout rate.
+cell #[code 0.2]
+row
+cell #[code dropout_decay]
+cell
+cell Rate of dropout change.
+cell #[code 0.0]
+row
+cell #[code batch_from]
+cell
+cell Initial batch size.
+cell #[code 1]
+row
+cell #[code batch_to]
+cell
+cell Final batch size.
+cell #[code 64]
+row
+cell #[code batch_compound]
+cell
+cell Rate of batch size acceleration.
+cell #[code 1.001]
+row
+cell #[code token_vector_width]
+cell
+cell Width of embedding tables and convolutional layers.
+cell #[code 128]
+row
+cell #[code embed_size]
+cell
+cell Number of rows in embedding tables.
+cell #[code 7500]
+row
+cell #[code parser_maxout_pieces]
+cell
+cell Number of pieces in the parser's and NER's first maxout layer.
+cell #[code 2]
+row
+cell #[code parser_hidden_depth]
+cell
+cell Number of hidden layers in the parser and NER.
+cell #[code 1]
+row
+cell #[code hidden_width]
+cell
+cell Size of the parser's and NER's hidden layers.
+cell #[code 128]
+row
+cell #[code learn_rate]
+cell
+cell Learning rate.
+cell #[code 0.001]
+row
+cell #[code optimizer_B1]
+cell
+cell Momentum for the Adam solver.
+cell #[code 0.9]
+row
+cell #[code optimizer_B2]
+cell
+cell Adagrad-momentum for the Adam solver.
+cell #[code 0.999]
+row
+cell #[code optimizer_eps]
+cell
+cell Epsylon value for the Adam solver.
+cell #[code 1e-08]
+row
+cell #[code L2_penalty]
+cell
+cell L2 regularisation penalty.
+cell #[code 1e-06]
+row
+cell #[code grad_norm_clip]
+cell
+cell Gradient L2 norm constraint.
+cell #[code 1.0]
+h(2, "package") Package
+h(3, "package") Package
p
| Generate a #[+a("/docs/usage/saving-loading#generating") model Python package]
| Generate a #[+a("/usage/training#models-generating") model Python package]
| from an existing model data directory. All data files are copied over.
| If the path to a meta.json is supplied, or a meta.json is found in the
| input directory, this file is used. Otherwise, the data can be entered
@ -336,8 +356,8 @@ p
| sure you're always using the latest versions. This means you need to be
| connected to the internet to use this command.
+code(false, "bash", "$").
spacy package [input_dir] [output_dir] [--meta] [--force]
+code(false, "bash", "$", false, false, true).
spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force]
+table(["Argument", "Type", "Description"])
+row
@ -353,14 +373,14 @@ p
+row
+cell #[code --meta-path], #[code -m]
+cell option
+cell Path to meta.json file (optional).
+cell #[+tag-new(2)] Path to meta.json file (optional).
+row
+cell #[code --create-meta], #[code -c]
+cell flag
+cell
| Create a meta.json file on the command line, even if one already
| exists in the directory.
| #[+tag-new(2)] Create a meta.json file on the command line, even
| if one already exists in the directory.
+row
+cell #[code --force], #[code -f]

View File

@ -0,0 +1,91 @@
//- 💫 DOCS > API > TOP-LEVEL > COMPATIBILITY
p
| All Python code is written in an
| #[strong intersection of Python 2 and Python 3]. This is easy in Cython,
| but somewhat ugly in Python. Logic that deals with Python or platform
| compatibility only lives in #[code spacy.compat]. To distinguish them from
| the builtin functions, replacement functions are suffixed with an
| undersocre, e.e #[code unicode_]. For specific checks, spaCy uses the
| #[code six] and #[code ftfy] packages.
+aside-code("Example").
from spacy.compat import unicode_, json_dumps
compatible_unicode = unicode_('hello world')
compatible_json = json_dumps({'key': 'value'})
+table(["Name", "Python 2", "Python 3"])
+row
+cell #[code compat.bytes_]
+cell #[code str]
+cell #[code bytes]
+row
+cell #[code compat.unicode_]
+cell #[code unicode]
+cell #[code str]
+row
+cell #[code compat.basestring_]
+cell #[code basestring]
+cell #[code str]
+row
+cell #[code compat.input_]
+cell #[code raw_input]
+cell #[code input]
+row
+cell #[code compat.json_dumps]
+cell #[code ujson.dumps] with #[code .decode('utf8')]
+cell #[code ujson.dumps]
+row
+cell #[code compat.path2str]
+cell #[code str(path)] with #[code .decode('utf8')]
+cell #[code str(path)]
+h(3, "is_config") compat.is_config
+tag function
p
| Check if a specific configuration of Python version and operating system
| matches the user's setup. Mostly used to display targeted error messages.
+aside-code("Example").
from spacy.compat import is_config
if is_config(python2=True, windows=True):
print("You are using Python 2 on Windows.")
+table(["Name", "Type", "Description"])
+row
+cell #[code python2]
+cell bool
+cell spaCy is executed with Python 2.x.
+row
+cell #[code python3]
+cell bool
+cell spaCy is executed with Python 3.x.
+row
+cell #[code windows]
+cell bool
+cell spaCy is executed on Windows.
+row
+cell #[code linux]
+cell bool
+cell spaCy is executed on Linux.
+row
+cell #[code osx]
+cell bool
+cell spaCy is executed on OS X or macOS.
+row("foot")
+cell returns
+cell bool
+cell Whether the specified configuration matches the user's platform.

View File

@ -1,14 +1,12 @@
//- 💫 DOCS > API > DISPLACY
include ../../_includes/_mixins
//- 💫 DOCS > API > TOP-LEVEL > DISPLACY
p
| As of v2.0, spaCy comes with a built-in visualization suite. For more
| info and examples, see the usage guide on
| #[+a("/docs/usage/visualizers") visualizing spaCy].
| #[+a("/usage/visualizers") visualizing spaCy].
+h(2, "serve") displacy.serve
+h(3, "displacy.serve") displacy.serve
+tag method
+tag-new(2)
@ -60,7 +58,7 @@ p
+cell bool
+cell
| Don't parse #[code Doc] and instead, expect a dict or list of
| dicts. #[+a("/docs/usage/visualizers#manual-usage") See here]
| dicts. #[+a("/usage/visualizers#manual-usage") See here]
| for formats and examples.
+cell #[code False]
@ -70,7 +68,7 @@ p
+cell Port to serve visualization.
+cell #[code 5000]
+h(2, "render") displacy.render
+h(3, "displacy.render") displacy.render
+tag method
+tag-new(2)
@ -127,24 +125,24 @@ p Render a dependency parse tree or named entity visualization.
+cell bool
+cell
| Don't parse #[code Doc] and instead, expect a dict or list of
| dicts. #[+a("/docs/usage/visualizers#manual-usage") See here]
| dicts. #[+a("/usage/visualizers#manual-usage") See here]
| for formats and examples.
+cell #[code False]
+footrow
+row("foot")
+cell returns
+cell unicode
+cell Rendered HTML markup.
+cell
+h(2, "options") Visualizer options
+h(3, "displacy_options") Visualizer options
p
| The #[code options] argument lets you specify additional settings for
| each visualizer. If a setting is not present in the options, the default
| value will be used.
+h(3, "options-dep") Dependency Visualizer options
+h(4, "options-dep") Dependency Visualizer options
+aside-code("Example").
options = {'compact': True, 'color': 'blue'}
@ -219,7 +217,7 @@ p
+cell Distance between words in px.
+cell #[code 175] / #[code 85] (compact)
+h(3, "options-ent") Named Entity Visualizer options
+h(4, "displacy_options-ent") Named Entity Visualizer options
+aside-code("Example").
options = {'ents': ['PERSON', 'ORG', 'PRODUCT'],
@ -244,6 +242,6 @@ p
p
| By default, displaCy comes with colours for all
| #[+a("/docs/api/annotation#named-entities") entity types supported by spaCy].
| #[+a("/api/annotation#named-entities") entity types supported by spaCy].
| If you're using custom entity types, you can use the #[code colors]
| setting to add your own colours for them.

View File

@ -1,15 +1,13 @@
//- 💫 DOCS > API > SPACY
//- 💫 DOCS > API > TOP-LEVEL > SPACY
include ../../_includes/_mixins
+h(2, "load") spacy.load
+h(3, "spacy.load") spacy.load
+tag function
+tag-model
p
| Load a model via its #[+a("/docs/usage/models#usage") shortcut link],
| Load a model via its #[+a("/usage/models#usage") shortcut link],
| the name of an installed
| #[+a("/docs/usage/saving-loading#generating") model package], a unicode
| #[+a("/usage/training#models-generating") model package], a unicode
| path or a #[code Path]-like object. spaCy will try resolving the load
| argument in this order. If a model is loaded from a shortcut link or
| package name, spaCy will assume it's a Python package and import it and
@ -38,25 +36,57 @@ p
+cell list
+cell
| Names of pipeline components to
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
| #[+a("/usage/processing-pipelines#disabling") disable].
+footrow
+row("foot")
+cell returns
+cell #[code Language]
+cell A #[code Language] object with the loaded model.
+infobox("⚠️ Deprecation note")
+infobox("Deprecation note", "⚠️")
.o-block
| As of spaCy 2.0, the #[code path] keyword argument is deprecated. spaCy
| will also raise an error if no model could be loaded and never just
| return an empty #[code Language] object. If you need a blank language,
| you need to import it explicitly (#[code from spacy.lang.en import English])
| or use #[+api("util#get_lang_class") #[code util.get_lang_class]].
| you can use the new function #[+api("spacy#blank") #[code spacy.blank()]]
| or import the class explicitly, e.g.
| #[code from spacy.lang.en import English].
+code-new nlp = spacy.load('/model')
+code-old nlp = spacy.load('en', path='/model')
+h(2, "info") spacy.info
+h(3, "spacy.blank") spacy.blank
+tag function
+tag-new(2)
p
| Create a blank model of a given language class. This function is the
| twin of #[code spacy.load()].
+aside-code("Example").
nlp_en = spacy.blank('en')
nlp_de = spacy.blank('de')
+table(["Name", "Type", "Description"])
+row
+cell #[code name]
+cell unicode
+cell ISO code of the language class to load.
+row
+cell #[code disable]
+cell list
+cell
| Names of pipeline components to
| #[+a("/usage/processing-pipelines#disabling") disable].
+row("foot")
+cell returns
+cell #[code Language]
+cell An empty #[code Language] object of the appropriate subclass.
+h(4, "spacy.info") spacy.info
+tag function
p
@ -83,13 +113,13 @@ p
+cell Print information as Markdown.
+h(2, "explain") spacy.explain
+h(3, "spacy.explain") spacy.explain
+tag function
p
| Get a description for a given POS tag, dependency label or entity type.
| For a list of available terms, see
| #[+src(gh("spacy", "spacy/glossary.py")) glossary.py].
| #[+src(gh("spacy", "spacy/glossary.py")) #[code glossary.py]].
+aside-code("Example").
spacy.explain('NORP')
@ -107,18 +137,18 @@ p
+cell unicode
+cell Term to explain.
+footrow
+row("foot")
+cell returns
+cell unicode
+cell The explanation, or #[code None] if not found in the glossary.
+h(2, "set_factory") spacy.set_factory
+h(3, "spacy.set_factory") spacy.set_factory
+tag function
+tag-new(2)
p
| Set a factory that returns a custom
| #[+a("/docs/usage/language-processing-pipeline") processing pipeline]
| #[+a("/usage/processing-pipelines") processing pipeline]
| component. Factories are useful for creating stateful components, especially ones which depend on shared data.
+aside-code("Example").

View File

@ -1,10 +1,8 @@
//- 💫 DOCS > API > UTIL
include ../../_includes/_mixins
//- 💫 DOCS > API > TOP-LEVEL > UTIL
p
| spaCy comes with a small collection of utility functions located in
| #[+src(gh("spaCy", "spacy/util.py")) spacy/util.py].
| #[+src(gh("spaCy", "spacy/util.py")) #[code spacy/util.py]].
| Because utility functions are mostly intended for
| #[strong internal use within spaCy], their behaviour may change with
| future releases. The functions documented on this page should be safe
@ -12,7 +10,7 @@ p
| recommend having additional tests in place if your application depends on
| any of spaCy's utilities.
+h(2, "get_data_path") util.get_data_path
+h(3, "util.get_data_path") util.get_data_path
+tag function
p
@ -25,12 +23,12 @@ p
+cell bool
+cell Only return path if it exists, otherwise return #[code None].
+footrow
+row("foot")
+cell returns
+cell #[code Path] / #[code None]
+cell Data path or #[code None].
+h(2, "set_data_path") util.set_data_path
+h(3, "util.set_data_path") util.set_data_path
+tag function
p
@ -47,12 +45,12 @@ p
+cell unicode or #[code Path]
+cell Path to new data directory.
+h(2, "get_lang_class") util.get_lang_class
+h(3, "util.get_lang_class") util.get_lang_class
+tag function
p
| Import and load a #[code Language] class. Allows lazy-loading
| #[+a("/docs/usage/adding-languages") language data] and importing
| #[+a("/usage/adding-languages") language data] and importing
| languages using the two-letter language code.
+aside-code("Example").
@ -67,12 +65,12 @@ p
+cell unicode
+cell Two-letter language code, e.g. #[code 'en'].
+footrow
+row("foot")
+cell returns
+cell #[code Language]
+cell Language class.
+h(2, "load_model") util.load_model
+h(3, "util.load_model") util.load_model
+tag function
+tag-new(2)
@ -101,12 +99,12 @@ p
+cell -
+cell Specific overrides, like pipeline components to disable.
+footrow
+row("foot")
+cell returns
+cell #[code Language]
+cell #[code Language] class with the loaded model.
+h(2, "load_model_from_path") util.load_model_from_path
+h(3, "util.load_model_from_path") util.load_model_from_path
+tag function
+tag-new(2)
@ -139,18 +137,18 @@ p
+cell -
+cell Specific overrides, like pipeline components to disable.
+footrow
+row("foot")
+cell returns
+cell #[code Language]
+cell #[code Language] class with the loaded model.
+h(2, "load_model_from_init_py") util.load_model_from_init_py
+h(3, "util.load_model_from_init_py") util.load_model_from_init_py
+tag function
+tag-new(2)
p
| A helper function to use in the #[code load()] method of a model package's
| #[+src(gh("spacy-dev-resources", "templates/model/en_model_name/__init__.py")) __init__.py].
| #[+src(gh("spacy-dev-resources", "templates/model/en_model_name/__init__.py")) #[code __init__.py]].
+aside-code("Example").
from spacy.util import load_model_from_init_py
@ -169,12 +167,12 @@ p
+cell -
+cell Specific overrides, like pipeline components to disable.
+footrow
+row("foot")
+cell returns
+cell #[code Language]
+cell #[code Language] class with the loaded model.
+h(2, "get_model_meta") util.get_model_meta
+h(3, "util.get_model_meta") util.get_model_meta
+tag function
+tag-new(2)
@ -190,17 +188,17 @@ p
+cell unicode or #[code Path]
+cell Path to model directory.
+footrow
+row("foot")
+cell returns
+cell dict
+cell The model's meta data.
+h(2, "is_package") util.is_package
+h(3, "util.is_package") util.is_package
+tag function
p
| Check if string maps to a package installed via pip. Mainly used to
| validate #[+a("/docs/usage/models") model packages].
| validate #[+a("/usage/models") model packages].
+aside-code("Example").
util.is_package('en_core_web_sm') # True
@ -212,18 +210,18 @@ p
+cell unicode
+cell Name of package.
+footrow
+row("foot")
+cell returns
+cell #[code bool]
+cell #[code True] if installed package, #[code False] if not.
+h(2, "get_package_path") util.get_package_path
+h(3, "util.get_package_path") util.get_package_path
+tag function
+tag-new(2)
p
| Get path to an installed package. Mainly used to resolve the location of
| #[+a("/docs/usage/models") model packages]. Currently imports the package
| #[+a("/usage/models") model packages]. Currently imports the package
| to find its path.
+aside-code("Example").
@ -236,12 +234,12 @@ p
+cell unicode
+cell Name of installed package.
+footrow
+row("foot")
+cell returns
+cell #[code Path]
+cell Path to model package directory.
+h(2, "is_in_jupyter") util.is_in_jupyter
+h(3, "util.is_in_jupyter") util.is_in_jupyter
+tag function
+tag-new(2)
@ -257,17 +255,17 @@ p
return display(HTML(html))
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell bool
+cell #[code True] if in Jupyter, #[code False] if not.
+h(2, "update_exc") util.update_exc
+h(3, "util.update_exc") util.update_exc
+tag function
p
| Update, validate and overwrite
| #[+a("/docs/usage/adding-languages#tokenizer-exceptions") tokenizer exceptions].
| #[+a("/usage/adding-languages#tokenizer-exceptions") tokenizer exceptions].
| Used to combine global exceptions with custom, language-specific
| exceptions. Will raise an error if key doesn't match #[code ORTH] values.
@ -288,20 +286,20 @@ p
+cell dicts
+cell Exception dictionaries to add to the base exceptions, in order.
+footrow
+row("foot")
+cell returns
+cell dict
+cell Combined tokenizer exceptions.
+h(2, "prints") util.prints
+h(3, "util.prints") util.prints
+tag function
+tag-new(2)
p
| Print a formatted, text-wrapped message with optional title. If a text
| argument is a #[code Path], it's converted to a string. Should only
| be used for interactive components like the #[+api("cli") cli].
| be used for interactive components like the command-line interface.
+aside-code("Example").
data_path = Path('/some/path')

131
website/api/annotation.jade Normal file
View File

@ -0,0 +1,131 @@
//- 💫 DOCS > API > ANNOTATION SPECS
include ../_includes/_mixins
p This document describes the target annotations spaCy is trained to predict.
+section("tokenization")
+h(2, "tokenization") Tokenization
p
| Tokenization standards are based on the
| #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus.
| The tokenizer differs from most by including tokens for significant
| whitespace. Any sequence of whitespace characters beyond a single space
| (#[code ' ']) is included as a token.
+aside-code("Example").
from spacy.lang.en import English
nlp = English()
tokens = nlp('Some\nspaces and\ttab characters')
tokens_text = [t.text for t in tokens]
assert tokens_text == ['Some', '\n', 'spaces', ' ', 'and',
'\t', 'tab', 'characters']
p
| The whitespace tokens are useful for much the same reason punctuation is
| it's often an important delimiter in the text. By preserving it in the
| token output, we are able to maintain a simple alignment between the
| tokens and the original string, and we ensure that no information is
| lost during processing.
+section("sbd")
+h(2, "sentence-boundary") Sentence boundary detection
p
| Sentence boundaries are calculated from the syntactic parse tree, so
| features such as punctuation and capitalisation play an important but
| non-decisive role in determining the sentence boundaries. Usually this
| means that the sentence boundaries will at least coincide with clause
| boundaries, even given poorly punctuated text.
+section("pos-tagging")
+h(2, "pos-tagging") Part-of-speech Tagging
+aside("Tip: Understanding tags")
| You can also use #[code spacy.explain()] to get the description for the
| string representation of a tag. For example,
| #[code spacy.explain("RB")] will return "adverb".
include _annotation/_pos-tags
+section("lemmatization")
+h(2, "lemmatization") Lemmatization
p A "lemma" is the uninflected form of a word. In English, this means:
+list
+item #[strong Adjectives]: The form like "happy", not "happier" or "happiest"
+item #[strong Adverbs]: The form like "badly", not "worse" or "worst"
+item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
+item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
p
| The lemmatization data is taken from
| #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
| special case for pronouns: all pronouns are lemmatized to the special
| token #[code -PRON-].
+infobox("About spaCy's custom pronoun lemma")
| Unlike verbs and common nouns, there's no clear base form of a personal
| pronoun. Should the lemma of "me" be "I", or should we normalize person
| as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
| novel symbol, #[code -PRON-], which is used as the lemma for
| all personal pronouns.
+section("dependency-parsing")
+h(2, "dependency-parsing") Syntactic Dependency Parsing
+aside("Tip: Understanding labels")
| You can also use #[code spacy.explain()] to get the description for the
| string representation of a label. For example,
| #[code spacy.explain("prt")] will return "particle".
include _annotation/_dep-labels
+section("named-entities")
+h(2, "named-entities") Named Entity Recognition
+aside("Tip: Understanding entity types")
| You can also use #[code spacy.explain()] to get the description for the
| string representation of an entity label. For example,
| #[code spacy.explain("LANGUAGE")] will return "any named language".
include _annotation/_named-entities
+h(3, "biluo") BILUO Scheme
include _annotation/_biluo
+section("training")
+h(2, "json-input") JSON input format for training
+under-construction
p spaCy takes training data in the following format:
+code("Example structure").
doc: {
id: string,
paragraphs: [{
raw: string,
sents: [int],
tokens: [{
start: int,
tag: string,
head: int,
dep: string
}],
ner: [{
start: int,
end: int,
label: string
}],
brackets: [{
start: int,
end: int,
label: string
}]
}]
}

View File

@ -1,6 +1,6 @@
//- 💫 DOCS > API > BINDER
include ../../_includes/_mixins
include ../_includes/_mixins
p A container class for serializing collections of #[code Doc] objects.

View File

@ -0,0 +1,5 @@
//- 💫 DOCS > API > DEPENDENCYPARSER
include ../_includes/_mixins
!=partial("pipe", { subclass: "DependencyParser", short: "parser", pipeline_id: "parser" })

View File

@ -1,8 +1,6 @@
//- 💫 DOCS > API > DOC
include ../../_includes/_mixins
p A container for accessing linguistic annotations.
include ../_includes/_mixins
p
| A #[code Doc] is a sequence of #[+api("token") #[code Token]] objects.
@ -47,7 +45,7 @@ p
| subsequent space. Must have the same length as #[code words], if
| specified. Defaults to a sequence of #[code True].
+footrow
+row("foot")
+cell returns
+cell #[code Doc]
+cell The newly constructed object.
@ -73,7 +71,7 @@ p
+cell int
+cell The index of the token.
+footrow
+row("foot")
+cell returns
+cell #[code Token]
+cell The token at #[code doc[i]].
@ -96,7 +94,7 @@ p
+cell tuple
+cell The slice of the document to get.
+footrow
+row("foot")
+cell returns
+cell #[code Span]
+cell The span at #[code doc[start : end]].
@ -120,7 +118,7 @@ p
| from Cython.
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Token]
+cell A #[code Token] object.
@ -135,7 +133,7 @@ p Get the number of tokens in the document.
assert len(doc) == 7
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell int
+cell The number of tokens in the document.
@ -172,7 +170,7 @@ p Create a #[code Span] object from the slice #[code doc.text[start : end]].
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell A meaning representation of the span.
+footrow
+row("foot")
+cell returns
+cell #[code Span]
+cell The newly constructed object.
@ -200,7 +198,7 @@ p
| The object to compare with. By default, accepts #[code Doc],
| #[code Span], #[code Token] and #[code Lexeme] objects.
+footrow
+row("foot")
+cell returns
+cell float
+cell A scalar similarity score. Higher is more similar.
@ -226,7 +224,7 @@ p
+cell int
+cell The attribute ID
+footrow
+row("foot")
+cell returns
+cell dict
+cell A dictionary mapping attributes to integer counts.
@ -251,7 +249,7 @@ p
+cell list
+cell A list of attribute ID ints.
+footrow
+row("foot")
+cell returns
+cell #[code.u-break numpy.ndarray[ndim=2, dtype='int32']]
+cell
@ -285,7 +283,7 @@ p
+cell #[code.u-break numpy.ndarray[ndim=2, dtype='int32']]
+cell The attribute values to load.
+footrow
+row("foot")
+cell returns
+cell #[code Doc]
+cell Itself.
@ -326,7 +324,7 @@ p Loads state from a directory. Modifies the object in place and returns it.
| A path to a directory. Paths may be either strings or
| #[code Path]-like objects.
+footrow
+row("foot")
+cell returns
+cell #[code Doc]
+cell The modified #[code Doc] object.
@ -341,7 +339,7 @@ p Serialize, i.e. export the document contents to a binary string.
doc_bytes = doc.to_bytes()
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell bytes
+cell
@ -367,7 +365,7 @@ p Deserialize, i.e. import the document contents from a binary string.
+cell bytes
+cell The string to load from.
+footrow
+row("foot")
+cell returns
+cell #[code Doc]
+cell The #[code Doc] object.
@ -378,7 +376,7 @@ p Deserialize, i.e. import the document contents from a binary string.
p
| Retokenize the document, such that the span at
| #[code doc.text[start_idx : end_idx]] is merged into a single token. If
| #[code start_idx] and #[end_idx] do not mark start and end token
| #[code start_idx] and #[code end_idx] do not mark start and end token
| boundaries, the document remains unchanged.
+aside-code("Example").
@ -405,7 +403,7 @@ p
| attributes are inherited from the syntactic root token of
| the span.
+footrow
+row("foot")
+cell returns
+cell #[code Token]
+cell
@ -440,7 +438,7 @@ p
+cell bool
+cell Don't include arcs or modifiers.
+footrow
+row("foot")
+cell returns
+cell dict
+cell Parse tree as dict.
@ -462,7 +460,7 @@ p
assert ents[0].text == 'Mr. Best'
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Span]
+cell Entities in the document.
@ -485,7 +483,7 @@ p
assert chunks[1].text == "another phrase"
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Span]
+cell Noun chunks in the document.
@ -507,7 +505,7 @@ p
assert [s.root.text for s in sents] == ["is", "'s"]
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Span]
+cell Sentences in the document.
@ -525,7 +523,7 @@ p
assert doc.has_vector
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell bool
+cell Whether the document has a vector data attached.
@ -544,7 +542,7 @@ p
assert doc.vector.shape == (300,)
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell A 1D numpy array representing the document's semantics.
@ -564,7 +562,7 @@ p
assert doc1.vector_norm != doc2.vector_norm
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell float
+cell The L2 norm of the vector representation.

View File

@ -0,0 +1,5 @@
//- 💫 DOCS > API > ENTITYRECOGNIZER
include ../_includes/_mixins
!=partial("pipe", { subclass: "EntityRecognizer", short: "ner", pipeline_id: "ner" })

View File

@ -1,14 +1,12 @@
//- 💫 DOCS > API > GOLDCORPUS
include ../../_includes/_mixins
include ../_includes/_mixins
p
| An annotated corpus, using the JSON file format. Manages annotations for
| tagging, dependency parsing and NER.
| This class manages annotations for tagging, dependency parsing and NER.
+h(2, "init") GoldCorpus.__init__
+tag method
+tag-new(2)
p Create a #[code GoldCorpus].

View File

@ -1,6 +1,6 @@
//- 💫 DOCS > API > GOLDPARSE
include ../../_includes/_mixins
include ../_includes/_mixins
p Collection for training annotations.
@ -40,7 +40,7 @@ p Create a #[code GoldParse].
+cell iterable
+cell A sequence of named entity annotations, either as BILUO tag strings, or as #[code (start_char, end_char, label)] tuples, representing the entity positions.
+footrow
+row("foot")
+cell returns
+cell #[code GoldParse]
+cell The newly constructed object.
@ -51,7 +51,7 @@ p Create a #[code GoldParse].
p Get the number of gold-standard tokens.
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell int
+cell The number of gold-standard tokens.
@ -64,7 +64,7 @@ p
| tree.
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell bool
+cell Whether annotations form projective tree.
@ -119,7 +119,7 @@ p
p
| Encode labelled spans into per-token tags, using the
| #[+a("/docs/api/annotation#biluo") BILUO scheme] (Begin/In/Last/Unit/Out).
| #[+a("/api/annotation#biluo") BILUO scheme] (Begin/In/Last/Unit/Out).
p
| Returns a list of unicode strings, describing the tags. Each tag string
@ -157,11 +157,11 @@ p
| and #[code end] should be character-offset integers denoting the
| slice into the original string.
+footrow
+row("foot")
+cell returns
+cell list
+cell
| Unicode strings, describing the
| #[+a("/docs/api/annotation#biluo") BILUO] tags.
| #[+a("/api/annotation#biluo") BILUO] tags.

14
website/api/index.jade Normal file
View File

@ -0,0 +1,14 @@
//- 💫 DOCS > API > ARCHITECTURE
include ../_includes/_mixins
+section("basics")
include ../usage/_spacy-101/_architecture
+section("nn-model")
+h(2, "nn-model") Neural network model architecture
include _architecture/_nn-model
+section("cython")
+h(2, "cython") Cython conventions
include _architecture/_cython

View File

@ -1,10 +1,10 @@
//- 💫 DOCS > API > LANGUAGE
include ../../_includes/_mixins
include ../_includes/_mixins
p
| A text-processing pipeline. Usually you'll load this once per process,
| and pass the instance around your application.
| Usually you'll load this once per process as #[code nlp] and pass the
| instance around your application.
+h(2, "init") Language.__init__
+tag method
@ -49,7 +49,7 @@ p Initialise a #[code Language] object.
| Custom meta data for the #[code Language] class. Is written to by
| models to add model meta data.
+footrow
+row("foot")
+cell returns
+cell #[code Language]
+cell The newly constructed object.
@ -77,14 +77,14 @@ p
+cell list
+cell
| Names of pipeline components to
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
| #[+a("/usage/processing-pipelines#disabling") disable].
+footrow
+row("foot")
+cell returns
+cell #[code Doc]
+cell A container for accessing the annotations.
+infobox("⚠️ Deprecation note")
+infobox("Deprecation note", "⚠️")
.o-block
| Pipeline components to prevent from being loaded can now be added as
| a list to #[code disable], instead of specifying one keyword argument
@ -136,9 +136,9 @@ p
+cell list
+cell
| Names of pipeline components to
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
| #[+a("/usage/processing-pipelines#disabling") disable].
+footrow
+row("foot")
+cell yields
+cell #[code Doc]
+cell Documents in the order of the original text.
@ -175,7 +175,7 @@ p Update the models in the pipeline.
+cell callable
+cell An optimizer.
+footrow
+row("foot")
+cell returns
+cell dict
+cell Results from the update.
@ -200,7 +200,7 @@ p
+cell -
+cell Config parameters.
+footrow
+row("foot")
+cell yields
+cell tuple
+cell An optimizer.
@ -242,7 +242,7 @@ p
+cell iterable
+cell Tuples of #[code Doc] and #[code GoldParse] objects.
+footrow
+row("foot")
+cell yields
+cell tuple
+cell Tuples of #[code Doc] and #[code GoldParse] objects.
@ -271,7 +271,7 @@ p
+cell list
+cell
| Names of pipeline components to
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable]
| #[+a("/usage/processing-pipelines#disabling") disable]
| and prevent from being saved.
+h(2, "from_disk") Language.from_disk
@ -300,14 +300,14 @@ p
+cell list
+cell
| Names of pipeline components to
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
| #[+a("/usage/processing-pipelines#disabling") disable].
+footrow
+row("foot")
+cell returns
+cell #[code Language]
+cell The modified #[code Language] object.
+infobox("⚠️ Deprecation note")
+infobox("Deprecation note", "⚠️")
.o-block
| As of spaCy v2.0, the #[code save_to_directory] method has been
| renamed to #[code to_disk], to improve consistency across classes.
@ -332,10 +332,10 @@ p Serialize the current state to a binary string.
+cell list
+cell
| Names of pipeline components to
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable]
| #[+a("/usage/processing-pipelines#disabling") disable]
| and prevent from being serialized.
+footrow
+row("foot")
+cell returns
+cell bytes
+cell The serialized form of the #[code Language] object.
@ -362,14 +362,14 @@ p Load state from a binary string.
+cell list
+cell
| Names of pipeline components to
| #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
| #[+a("/usage/processing-pipelines#disabling") disable].
+footrow
+row("foot")
+cell returns
+cell #[code Language]
+cell The #[code Language] object.
+infobox("⚠️ Deprecation note")
+infobox("Deprecation note", "⚠️")
.o-block
| Pipeline components to prevent from being loaded can now be added as
| a list to #[code disable], instead of specifying one keyword argument

View File

@ -0,0 +1,5 @@
//- 💫 DOCS > API > LEMMATIZER
include ../_includes/_mixins
+under-construction

View File

@ -1,6 +1,6 @@
//- 💫 DOCS > API > LEXEME
include ../../_includes/_mixins
include ../_includes/_mixins
p
| An entry in the vocabulary. A #[code Lexeme] has no string context it's
@ -24,7 +24,7 @@ p Create a #[code Lexeme] object.
+cell int
+cell The orth id of the lexeme.
+footrow
+row("foot")
+cell returns
+cell #[code Lexeme]
+cell The newly constructed object.
@ -65,7 +65,7 @@ p Check the value of a boolean flag.
+cell int
+cell The attribute ID of the flag to query.
+footrow
+row("foot")
+cell returns
+cell bool
+cell The value of the flag.
@ -91,7 +91,7 @@ p Compute a semantic similarity estimate. Defaults to cosine over vectors.
| The object to compare with. By default, accepts #[code Doc],
| #[code Span], #[code Token] and #[code Lexeme] objects.
+footrow
+row("foot")
+cell returns
+cell float
+cell A scalar similarity score. Higher is more similar.
@ -110,7 +110,7 @@ p
assert apple.has_vector
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell bool
+cell Whether the lexeme has a vector data attached.
@ -127,7 +127,7 @@ p A real-valued meaning representation.
assert apple.vector.shape == (300,)
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell A 1D numpy array representing the lexeme's semantics.
@ -146,7 +146,7 @@ p The L2 norm of the lexeme's vector representation.
assert apple.vector_norm != pasta.vector_norm
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell float
+cell The L2 norm of the vector representation.

View File

@ -1,10 +1,8 @@
//- 💫 DOCS > API > MATCHER
include ../../_includes/_mixins
include ../_includes/_mixins
p Match sequences of tokens, based on pattern rules.
+infobox("⚠️ Deprecation note")
+infobox("Deprecation note", "⚠️")
| As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
| are deprecated and have been replaced with a simpler
| #[+api("matcher#add") #[code Matcher.add]] that lets you add a list of
@ -39,7 +37,7 @@ p Create the rule-based #[code Matcher].
+cell dict
+cell Patterns to add to the matcher, keyed by ID.
+footrow
+row("foot")
+cell returns
+cell #[code Matcher]
+cell The newly constructed object.
@ -64,7 +62,7 @@ p Find all token sequences matching the supplied patterns on the #[code Doc].
+cell #[code Doc]
+cell The document to match over.
+footrow
+row("foot")
+cell returns
+cell list
+cell
@ -81,7 +79,7 @@ p Find all token sequences matching the supplied patterns on the #[code Doc].
| actions per pattern within the same matcher. For example, you might only
| want to merge some entity types, and set custom flags for other matched
| patterns. For more details and examples, see the usage guide on
| #[+a("/docs/usage/rule-based-matching") rule-based matching].
| #[+a("/usage/linguistic-features#rule-based-matching") rule-based matching].
+h(2, "pipe") Matcher.pipe
+tag method
@ -113,7 +111,7 @@ p Match a stream of documents, yielding them in turn.
| parallel, if the #[code Matcher] implementation supports
| multi-threading.
+footrow
+row("foot")
+cell yields
+cell #[code Doc]
+cell Documents, in order.
@ -134,7 +132,7 @@ p
assert len(matcher) == 1
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell int
+cell The number of rules.
@ -156,7 +154,8 @@ p Check whether the matcher contains rules for a match ID.
+cell #[code key]
+cell unicode
+cell The match ID.
+footrow
+row("foot")
+cell returns
+cell int
+cell Whether the matcher contains rules for this match ID.
@ -203,7 +202,7 @@ p
| Match pattern. A pattern consists of a list of dicts, where each
| dict describes a token.
+infobox("⚠️ Deprecation note")
+infobox("Deprecation note", "⚠️")
.o-block
| As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
| are deprecated and have been replaced with a simpler
@ -257,7 +256,7 @@ p
+cell unicode
+cell The ID of the match rule.
+footrow
+row("foot")
+cell returns
+cell tuple
+cell The rule, as an #[code (on_match, patterns)] tuple.

View File

@ -0,0 +1,181 @@
//- 💫 DOCS > API > PHRASEMATCHER
include ../_includes/_mixins
p
| The #[code PhraseMatcher] lets you efficiently match large terminology
| lists. While the #[+api("matcher") #[code Matcher]] lets you match
| squences based on lists of token descriptions, the #[code PhraseMatcher]
| accepts match patterns in the form of #[code Doc] objects.
+h(2, "init") PhraseMatcher.__init__
+tag method
p Create the rule-based #[code PhraseMatcher].
+aside-code("Example").
from spacy.matcher import PhraseMatcher
matcher = Matcher(nlp.vocab, max_length=6)
+table(["Name", "Type", "Description"])
+row
+cell #[code vocab]
+cell #[code Vocab]
+cell
| The vocabulary object, which must be shared with the documents
| the matcher will operate on.
+row
+cell #[code max_length]
+cell int
+cell Mamimum length of a phrase pattern to add.
+row("foot")
+cell returns
+cell #[code PhraseMatcher]
+cell The newly constructed object.
+h(2, "call") PhraseMatcher.__call__
+tag method
p Find all token sequences matching the supplied patterns on the #[code Doc].
+aside-code("Example").
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
matcher.add('OBAMA', None, nlp(u"Barack Obama"))
doc = nlp(u"Barack Obama lifts America one last time in emotional farewell")
matches = matcher(doc)
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The document to match over.
+row("foot")
+cell returns
+cell list
+cell
| A list of #[code (match_id, start, end)] tuples, describing the
| matches. A match tuple describes a span #[code doc[start:end]].
| The #[code match_id] is the ID of the added match pattern.
+h(2, "pipe") PhraseMatcher.pipe
+tag method
p Match a stream of documents, yielding them in turn.
+aside-code("Example").
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
for doc in matcher.pipe(texts, batch_size=50, n_threads=4):
pass
+table(["Name", "Type", "Description"])
+row
+cell #[code docs]
+cell iterable
+cell A stream of documents.
+row
+cell #[code batch_size]
+cell int
+cell The number of documents to accumulate into a working set.
+row
+cell #[code n_threads]
+cell int
+cell
| The number of threads with which to work on the buffer in
| parallel, if the #[code PhraseMatcher] implementation supports
| multi-threading.
+row("foot")
+cell yields
+cell #[code Doc]
+cell Documents, in order.
+h(2, "len") PhraseMatcher.__len__
+tag method
p
| Get the number of rules added to the matcher. Note that this only returns
| the number of rules (identical with the number of IDs), not the number
| of individual patterns.
+aside-code("Example").
matcher = PhraseMatcher(nlp.vocab)
assert len(matcher) == 0
matcher.add('OBAMA', None, nlp(u"Barack Obama"))
assert len(matcher) == 1
+table(["Name", "Type", "Description"])
+row("foot")
+cell returns
+cell int
+cell The number of rules.
+h(2, "contains") PhraseMatcher.__contains__
+tag method
p Check whether the matcher contains rules for a match ID.
+aside-code("Example").
matcher = PhraseMatcher(nlp.vocab)
assert len(matcher) == 0
matcher.add('OBAMA', None, nlp(u"Barack Obama"))
assert len(matcher) == 1
+table(["Name", "Type", "Description"])
+row
+cell #[code key]
+cell unicode
+cell The match ID.
+row("foot")
+cell returns
+cell int
+cell Whether the matcher contains rules for this match ID.
+h(2, "add") PhraseMatcher.add
+tag method
p
| Add a rule to the matcher, consisting of an ID key, one or more patterns, and
| a callback function to act on the matches. The callback function will
| receive the arguments #[code matcher], #[code doc], #[code i] and
| #[code matches]. If a pattern already exists for the given ID, the
| patterns will be extended. An #[code on_match] callback will be
| overwritten.
+aside-code("Example").
def on_match(matcher, doc, id, matches):
print('Matched!', matches)
matcher = PhraseMatcher(nlp.vocab)
matcher.add('OBAMA', on_match, nlp(u"Barack Obama"))
matcher.add('HEALTH', on_match, nlp(u"health care reform"),
nlp(u"healthcare reform"))
doc = nlp(u"Barack Obama urges Congress to find courage to defend his healthcare reforms")
matches = matcher(doc)
+table(["Name", "Type", "Description"])
+row
+cell #[code match_id]
+cell unicode
+cell An ID for the thing you're matching.
+row
+cell #[code on_match]
+cell callable or #[code None]
+cell
| Callback function to act on matches. Takes the arguments
| #[code matcher], #[code doc], #[code i] and #[code matches].
+row
+cell #[code *docs]
+cell list
+cell
| #[code Doc] objects of the phrases to match.

390
website/api/pipe.jade Normal file
View File

@ -0,0 +1,390 @@
//- 💫 DOCS > API > PIPE
include ../_includes/_mixins
//- This page can be used as a template for all other classes that inherit
//- from `Pipe`.
if subclass
+infobox
| This class is a subclass of #[+api("pipe") #[code Pipe]] and
| follows the same API. The pipeline component is available in the
| #[+a("/usage/processing-pipelines") processing pipeline] via the ID
| #[code "#{pipeline_id}"].
else
p
| This class is not instantiated directly. Components inherit from it,
| and it defines the interface that components should follow to
| function as components in a spaCy analysis pipeline.
- CLASSNAME = subclass || 'Pipe'
- VARNAME = short || CLASSNAME.toLowerCase()
+h(2, "model") #{CLASSNAME}.Model
+tag classmethod
p
| Initialise a model for the pipe. The model should implement the
| #[code thinc.neural.Model] API. Wrappers are available for
| #[+a("/usage/deep-learning") most major machine learning libraries].
+table(["Name", "Type", "Description"])
+row
+cell #[code **kwargs]
+cell -
+cell Parameters for initialising the model
+row("foot")
+cell returns
+cell object
+cell The initialised model.
+h(2, "init") #{CLASSNAME}.__init__
+tag method
p Create a new pipeline instance.
+aside-code("Example").
from spacy.pipeline import #{CLASSNAME}
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
+table(["Name", "Type", "Description"])
+row
+cell #[code vocab]
+cell #[code Vocab]
+cell The shared vocabulary.
+row
+cell #[code model]
+cell #[code thinc.neural.Model] or #[code True]
+cell
| The model powering the pipeline component. If no model is
| supplied, the model is created when you call
| #[code begin_training], #[code from_disk] or #[code from_bytes].
+row
+cell #[code **cfg]
+cell -
+cell Configuration parameters.
+row("foot")
+cell returns
+cell #[code=CLASSNAME]
+cell The newly constructed object.
+h(2, "call") #{CLASSNAME}.__call__
+tag method
p
| Apply the pipe to one document. The document is modified in place, and
| returned. Both #[code #{CLASSNAME}.__call__] and
| #[code #{CLASSNAME}.pipe] should delegate to the
| #[code #{CLASSNAME}.predict] and #[code #{CLASSNAME}.set_annotations]
| methods.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
doc = nlp(u"This is a sentence.")
processed = #{VARNAME}(doc)
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The document to process.
+row("foot")
+cell returns
+cell #[code Doc]
+cell The processed document.
+h(2, "pipe") #{CLASSNAME}.pipe
+tag method
p
| Apply the pipe to a stream of documents. Both
| #[code #{CLASSNAME}.__call__] and #[code #{CLASSNAME}.pipe] should
| delegate to the #[code #{CLASSNAME}.predict] and
| #[code #{CLASSNAME}.set_annotations] methods.
+aside-code("Example").
texts = [u'One doc', u'...', u'Lots of docs']
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
for doc in #{VARNAME}.pipe(texts, batch_size=50):
pass
+table(["Name", "Type", "Description"])
+row
+cell #[code stream]
+cell iterable
+cell A stream of documents.
+row
+cell #[code batch_size]
+cell int
+cell The number of texts to buffer. Defaults to #[code 128].
+row
+cell #[code n_threads]
+cell int
+cell
| The number of worker threads to use. If #[code -1], OpenMP will
| decide how many to use at run time. Default is #[code -1].
+row("foot")
+cell yields
+cell #[code Doc]
+cell Processed documents in the order of the original text.
+h(2, "predict") #{CLASSNAME}.predict
+tag method
p
| Apply the pipeline's model to a batch of docs, without modifying them.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
scores = #{VARNAME}.predict([doc1, doc2])
+table(["Name", "Type", "Description"])
+row
+cell #[code docs]
+cell iterable
+cell The documents to predict.
+row("foot")
+cell returns
+cell -
+cell Scores from the model.
+h(2, "set_annotations") #{CLASSNAME}.set_annotations
+tag method
p
| Modify a batch of documents, using pre-computed scores.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
scores = #{VARNAME}.predict([doc1, doc2])
#{VARNAME}.set_annotations([doc1, doc2], scores)
+table(["Name", "Type", "Description"])
+row
+cell #[code docs]
+cell iterable
+cell The documents to modify.
+row
+cell #[code scores]
+cell -
+cell The scores to set, produced by #[code #{CLASSNAME}.predict].
+h(2, "update") #{CLASSNAME}.update
+tag method
p
| Learn from a batch of documents and gold-standard information, updating
| the pipe's model. Delegates to #[code #{CLASSNAME}.predict] and
| #[code #{CLASSNAME}.get_loss].
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
losses = {}
optimizer = nlp.begin_training()
#{VARNAME}.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
+table(["Name", "Type", "Description"])
+row
+cell #[code docs]
+cell iterable
+cell A batch of documents to learn from.
+row
+cell #[code golds]
+cell iterable
+cell The gold-standard data. Must have the same length as #[code docs].
+row
+cell #[code drop]
+cell int
+cell The dropout rate.
+row
+cell #[code sgd]
+cell callable
+cell
| The optimizer. Should take two arguments #[code weights] and
| #[code gradient], and an optional ID.
+row
+cell #[code losses]
+cell dict
+cell
| Optional record of the loss during training. The value keyed by
| the model's name is updated.
+h(2, "get_loss") #{CLASSNAME}.get_loss
+tag method
p
| Find the loss and gradient of loss for the batch of documents and their
| predicted scores.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
scores = #{VARNAME}.predict([doc1, doc2])
loss, d_loss = #{VARNAME}.get_loss([doc1, doc2], [gold1, gold2], scores)
+table(["Name", "Type", "Description"])
+row
+cell #[code docs]
+cell iterable
+cell The batch of documents.
+row
+cell #[code golds]
+cell iterable
+cell The gold-standard data. Must have the same length as #[code docs].
+row
+cell #[code scores]
+cell -
+cell Scores representing the model's predictions.
+row("foot")
+cell returns
+cell tuple
+cell The loss and the gradient, i.e. #[code (loss, gradient)].
+h(2, "begin_training") #{CLASSNAME}.begin_training
+tag method
p
| Initialize the pipe for training, using data exampes if available. If no
| model has been initialized yet, the model is added.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
nlp.pipeline.append(#{VARNAME})
#{VARNAME}.begin_training(pipeline=nlp.pipeline)
+table(["Name", "Type", "Description"])
+row
+cell #[code gold_tuples]
+cell iterable
+cell
| Optional gold-standard annotations from which to construct
| #[+api("goldparse") #[code GoldParse]] objects.
+row
+cell #[code pipeline]
+cell list
+cell
| Optional list of #[+api("pipe") #[code Pipe]] components that
| this component is part of.
+h(2, "use_params") #{CLASSNAME}.use_params
+tag method
+tag contextmanager
p Modify the pipe's model, to use the given parameter values.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
with #{VARNAME}.use_params():
#{VARNAME}.to_disk('/best_model')
+table(["Name", "Type", "Description"])
+row
+cell #[code params]
+cell -
+cell
| The parameter values to use in the model. At the end of the
| context, the original parameters are restored.
+h(2, "to_disk") #{CLASSNAME}.to_disk
+tag method
p Serialize the pipe to disk.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
#{VARNAME}.to_disk('/path/to/#{VARNAME}')
+table(["Name", "Type", "Description"])
+row
+cell #[code path]
+cell unicode or #[code Path]
+cell
| A path to a directory, which will be created if it doesn't exist.
| Paths may be either strings or #[code Path]-like objects.
+h(2, "from_disk") #{CLASSNAME}.from_disk
+tag method
p Load the pipe from disk. Modifies the object in place and returns it.
+aside-code("Example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
#{VARNAME}.from_disk('/path/to/#{VARNAME}')
+table(["Name", "Type", "Description"])
+row
+cell #[code path]
+cell unicode or #[code Path]
+cell
| A path to a directory. Paths may be either strings or
| #[code Path]-like objects.
+row("foot")
+cell returns
+cell #[code=CLASSNAME]
+cell The modified #[code=CLASSNAME] object.
+h(2, "to_bytes") #{CLASSNAME}.to_bytes
+tag method
+aside-code("example").
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
#{VARNAME}_bytes = #{VARNAME}.to_bytes()
p Serialize the pipe to a bytestring.
+table(["Name", "Type", "Description"])
+row
+cell #[code **exclude]
+cell -
+cell Named attributes to prevent from being serialized.
+row("foot")
+cell returns
+cell bytes
+cell The serialized form of the #[code=CLASSNAME] object.
+h(2, "from_bytes") #{CLASSNAME}.from_bytes
+tag method
p Load the pipe from a bytestring. Modifies the object in place and returns it.
+aside-code("Example").
#{VARNAME}_bytes = #{VARNAME}.to_bytes()
#{VARNAME} = #{CLASSNAME}(nlp.vocab)
#{VARNAME}.from_bytes(#{VARNAME}_bytes)
+table(["Name", "Type", "Description"])
+row
+cell #[code bytes_data]
+cell bytes
+cell The data to load from.
+row
+cell #[code **exclude]
+cell -
+cell Named attributes to prevent from being loaded.
+row("foot")
+cell returns
+cell #[code=CLASSNAME]
+cell The #[code=CLASSNAME] object.

View File

@ -1,6 +1,6 @@
//- 💫 DOCS > API > SPAN
include ../../_includes/_mixins
include ../_includes/_mixins
p A slice from a #[+api("doc") #[code Doc]] object.
@ -40,7 +40,7 @@ p Create a Span object from the #[code slice doc[start : end]].
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell A meaning representation of the span.
+footrow
+row("foot")
+cell returns
+cell #[code Span]
+cell The newly constructed object.
@ -61,7 +61,7 @@ p Get a #[code Token] object.
+cell int
+cell The index of the token within the span.
+footrow
+row("foot")
+cell returns
+cell #[code Token]
+cell The token at #[code span[i]].
@ -79,7 +79,7 @@ p Get a #[code Span] object.
+cell tuple
+cell The slice of the span to get.
+footrow
+row("foot")
+cell returns
+cell #[code Span]
+cell The span at #[code span[start : end]].
@ -95,7 +95,7 @@ p Iterate over #[code Token] objects.
assert [t.text for t in span] == ['it', 'back', '!']
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Token]
+cell A #[code Token] object.
@ -111,7 +111,7 @@ p Get the number of tokens in the span.
assert len(span) == 3
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell int
+cell The number of tokens in the span.
@ -140,7 +140,7 @@ p
| The object to compare with. By default, accepts #[code Doc],
| #[code Span], #[code Token] and #[code Lexeme] objects.
+footrow
+row("foot")
+cell returns
+cell float
+cell A scalar similarity score. Higher is more similar.
@ -167,7 +167,7 @@ p
+cell list
+cell A list of attribute ID ints.
+footrow
+row("foot")
+cell returns
+cell #[code.u-break numpy.ndarray[long, ndim=2]]
+cell
@ -194,7 +194,7 @@ p Retokenize the document, such that the span is merged into a single token.
| Attributes to assign to the merged token. By default, attributes
| are inherited from the syntactic root token of the span.
+footrow
+row("foot")
+cell returns
+cell #[code Token]
+cell The newly merged token.
@ -216,7 +216,7 @@ p
assert new_york.root.text == 'York'
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell #[code Token]
+cell The root token.
@ -233,7 +233,7 @@ p Tokens that are to the left of the span, whose head is within the span.
assert lefts == [u'New']
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Token]
+cell A left-child of a token of the span.
@ -250,7 +250,7 @@ p Tokens that are to the right of the span, whose head is within the span.
assert rights == [u'in']
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Token]
+cell A right-child of a token of the span.
@ -267,7 +267,7 @@ p Tokens that descend from tokens in the span, but fall outside it.
assert subtree == [u'Give', u'it', u'back', u'!']
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Token]
+cell A descendant of a token within the span.
@ -285,7 +285,7 @@ p
assert doc[1:].has_vector
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell bool
+cell Whether the span has a vector data attached.
@ -304,7 +304,7 @@ p
assert doc[1:].vector.shape == (300,)
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell A 1D numpy array representing the span's semantics.
@ -323,7 +323,7 @@ p
assert doc[1:].vector_norm != doc[2:].vector_norm
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell float
+cell The L2 norm of the vector representation.

View File

@ -1,6 +1,6 @@
//- 💫 DOCS > API > STRINGSTORE
include ../../_includes/_mixins
include ../_includes/_mixins
p
| Look up strings by 64-bit hashes. As of v2.0, spaCy uses hash values
@ -23,7 +23,7 @@ p
+cell iterable
+cell A sequence of unicode strings to add to the store.
+footrow
+row("foot")
+cell returns
+cell #[code StringStore]
+cell The newly constructed object.
@ -38,7 +38,7 @@ p Get the number of strings in the store.
assert len(stringstore) == 2
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell int
+cell The number of strings in the store.
@ -60,7 +60,7 @@ p Retrieve a string from a given hash, or vice versa.
+cell bytes, unicode or uint64
+cell The value to encode.
+footrow
+row("foot")
+cell returns
+cell unicode or int
+cell The value to be retrieved.
@ -81,7 +81,7 @@ p Check whether a string is in the store.
+cell unicode
+cell The string to check.
+footrow
+row("foot")
+cell returns
+cell bool
+cell Whether the store contains the string.
@ -100,7 +100,7 @@ p
assert all_strings == [u'apple', u'orange']
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell unicode
+cell A string in the store.
@ -125,7 +125,7 @@ p Add a string to the #[code StringStore].
+cell unicode
+cell The string to add.
+footrow
+row("foot")
+cell returns
+cell uint64
+cell The string's hash value.
@ -166,7 +166,7 @@ p Loads state from a directory. Modifies the object in place and returns it.
| A path to a directory. Paths may be either strings or
| #[code Path]-like objects.
+footrow
+row("foot")
+cell returns
+cell #[code StringStore]
+cell The modified #[code StringStore] object.
@ -185,7 +185,7 @@ p Serialize the current state to a binary string.
+cell -
+cell Named attributes to prevent from being serialized.
+footrow
+row("foot")
+cell returns
+cell bytes
+cell The serialized form of the #[code StringStore] object.
@ -211,7 +211,7 @@ p Load state from a binary string.
+cell -
+cell Named attributes to prevent from being loaded.
+footrow
+row("foot")
+cell returns
+cell #[code StringStore]
+cell The #[code StringStore] object.
@ -233,7 +233,7 @@ p Get a 64-bit hash for a given string.
+cell unicode
+cell The string to hash.
+footrow
+row("foot")
+cell returns
+cell uint64
+cell The hash.

5
website/api/tagger.jade Normal file
View File

@ -0,0 +1,5 @@
//- 💫 DOCS > API > TAGGER
include ../_includes/_mixins
!=partial("pipe", { subclass: "Tagger", pipeline_id: "tagger" })

View File

@ -0,0 +1,5 @@
//- 💫 DOCS > API > TENSORIZER
include ../_includes/_mixins
!=partial("pipe", { subclass: "Tensorizer", pipeline_id: "tensorizer" })

View File

@ -0,0 +1,19 @@
//- 💫 DOCS > API > TEXTCATEGORIZER
include ../_includes/_mixins
p
| The model supports classification with multiple, non-mutually exclusive
| labels. You can change the model architecture rather easily, but by
| default, the #[code TextCategorizer] class uses a convolutional
| neural network to assign position-sensitive vectors to each word in the
| document. This step is similar to the #[+api("tensorizer") #[code Tensorizer]]
| component, but the #[code TextCategorizer] uses its own CNN model, to
| avoid sharing weights with the other pipeline components. The document
| tensor is then
| summarized by concatenating max and mean pooling, and a multilayer
| perceptron is used to predict an output vector of length #[code nr_class],
| before a logistic activation is applied elementwise. The value of each
| output neuron is the probability that some class is present.
!=partial("pipe", { subclass: "TextCategorizer", short: "textcat", pipeline_id: "textcat" })

View File

@ -1,6 +1,6 @@
//- 💫 DOCS > API > TOKEN
include ../../_includes/_mixins
include ../_includes/_mixins
p An individual token — i.e. a word, punctuation symbol, whitespace, etc.
@ -30,7 +30,7 @@ p Construct a #[code Token] object.
+cell int
+cell The index of the token within the document.
+footrow
+row("foot")
+cell returns
+cell #[code Token]
+cell The newly constructed object.
@ -46,7 +46,7 @@ p The number of unicode characters in the token, i.e. #[code token.text].
assert len(token) == 4
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell int
+cell The number of unicode characters in the token.
@ -68,7 +68,7 @@ p Check the value of a boolean flag.
+cell int
+cell The attribute ID of the flag to check.
+footrow
+row("foot")
+cell returns
+cell bool
+cell Whether the flag is set.
@ -93,7 +93,7 @@ p Compute a semantic similarity estimate. Defaults to cosine over vectors.
| The object to compare with. By default, accepts #[code Doc],
| #[code Span], #[code Token] and #[code Lexeme] objects.
+footrow
+row("foot")
+cell returns
+cell float
+cell A scalar similarity score. Higher is more similar.
@ -114,7 +114,7 @@ p Get a neighboring token.
+cell int
+cell The relative position of the token to get. Defaults to #[code 1].
+footrow
+row("foot")
+cell returns
+cell #[code Token]
+cell The token at position #[code self.doc[self.i+i]].
@ -139,7 +139,7 @@ p
+cell #[code Token]
+cell Another token.
+footrow
+row("foot")
+cell returns
+cell bool
+cell Whether this token is the ancestor of the descendant.
@ -158,7 +158,7 @@ p The rightmost token of this token's syntactic descendants.
assert [t.text for t in he_ancestors] == [u'pleaded']
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Token]
+cell
@ -177,7 +177,7 @@ p A sequence of coordinated tokens, including the token itself.
assert [t.text for t in apples_conjuncts] == [u'oranges']
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Token]
+cell A coordinated token.
@ -194,7 +194,7 @@ p A sequence of the token's immediate syntactic children.
assert [t.text for t in give_children] == [u'it', u'back', u'!']
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Token]
+cell A child token such that #[code child.head==self].
@ -211,7 +211,7 @@ p A sequence of all the token's syntactic descendents.
assert [t.text for t in give_subtree] == [u'Give', u'it', u'back', u'!']
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Token]
+cell A descendant token such that #[code self.is_ancestor(descendant)].
@ -230,7 +230,7 @@ p
assert apples.has_vector
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell bool
+cell Whether the token has a vector data attached.
@ -248,7 +248,7 @@ p A real-valued meaning representation.
assert apples.vector.shape == (300,)
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell A 1D numpy array representing the token's semantics.
@ -268,7 +268,7 @@ p The L2 norm of the token's vector representation.
assert apples.vector_norm != pasta.vector_norm
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell float
+cell The L2 norm of the vector representation.
@ -280,20 +280,29 @@ p The L2 norm of the token's vector representation.
+cell #[code text]
+cell unicode
+cell Verbatim text content.
+row
+cell #[code text_with_ws]
+cell unicode
+cell Text content, with trailing space character if present.
+row
+cell #[code whitespace]
+cell int
+cell Trailing space character if present.
+row
+cell #[code whitespace_]
+cell unicode
+cell Trailing space character if present.
+row
+cell #[code orth]
+cell int
+cell ID of the verbatim text content.
+row
+cell #[code orth_]
+cell unicode
+cell
| Verbatim text content (identical to #[code Token.text]). Existst
| mostly for consistency with the other attributes.
+row
+cell #[code vocab]
+cell #[code Vocab]

View File

@ -1,6 +1,6 @@
//- 💫 DOCS > API > TOKENIZER
include ../../_includes/_mixins
include ../_includes/_mixins
p
| Segment text, and create #[code Doc] objects with the discovered segment
@ -57,7 +57,7 @@ p Create a #[code Tokenizer], to create #[code Doc] objects given unicode text.
+cell callable
+cell A boolean function matching strings to be recognised as tokens.
+footrow
+row("foot")
+cell returns
+cell #[code Tokenizer]
+cell The newly constructed object.
@ -77,7 +77,7 @@ p Tokenize a string.
+cell unicode
+cell The string to tokenize.
+footrow
+row("foot")
+cell returns
+cell #[code Doc]
+cell A container for linguistic annotations.
@ -110,7 +110,7 @@ p Tokenize a stream of texts.
| The number of threads to use, if the implementation supports
| multi-threading. The default tokenizer is single-threaded.
+footrow
+row("foot")
+cell yields
+cell #[code Doc]
+cell A sequence of Doc objects, in order.
@ -126,7 +126,7 @@ p Find internal split points of the string.
+cell unicode
+cell The string to split.
+footrow
+row("foot")
+cell returns
+cell list
+cell
@ -147,7 +147,7 @@ p
+cell unicode
+cell The string to segment.
+footrow
+row("foot")
+cell returns
+cell int
+cell The length of the prefix if present, otherwise #[code None].
@ -165,7 +165,7 @@ p
+cell unicode
+cell The string to segment.
+footrow
+row("foot")
+cell returns
+cell int / #[code None]
+cell The length of the suffix if present, otherwise #[code None].
@ -176,7 +176,7 @@ p
p
| Add a special-case tokenization rule. This mechanism is also used to add
| custom tokenizer exceptions to the language data. See the usage guide
| on #[+a("/docs/usage/adding-languages#tokenizer-exceptions") adding languages]
| on #[+a("/usage/adding-languages#tokenizer-exceptions") adding languages]
| for more details and examples.
+aside-code("Example").

View File

@ -0,0 +1,24 @@
//- 💫 DOCS > API > TOP-LEVEL
include ../_includes/_mixins
+section("spacy")
//-+h(2, "spacy") spaCy
//- spacy/__init__.py
include _top-level/_spacy
+section("displacy")
+h(2, "displacy", "spacy/displacy") displaCy
include _top-level/_displacy
+section("util")
+h(2, "util", "spacy/util.py") Utility functions
include _top-level/_util
+section("compat")
+h(2, "compat", "spacy/compaty.py") Compatibility functions
include _top-level/_compat
+section("cli", "spacy/cli")
+h(2, "cli") Command line
include _top-level/_cli

333
website/api/vectors.jade Normal file
View File

@ -0,0 +1,333 @@
//- 💫 DOCS > API > VECTORS
include ../_includes/_mixins
p
| Vectors data is kept in the #[code Vectors.data] attribute, which should
| be an instance of #[code numpy.ndarray] (for CPU vectors) or
| #[code cupy.ndarray] (for GPU vectors).
+h(2, "init") Vectors.__init__
+tag method
p
| Create a new vector store. To keep the vector table empty, pass
| #[code data_or_width=0]. You can also create the vector table and add
| vectors one by one, or set the vector values directly on initialisation.
+aside-code("Example").
from spacy.vectors import Vectors
from spacy.strings import StringStore
empty_vectors = Vectors(StringStore())
vectors = Vectors([u'cat'], 300)
vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
vector_table = numpy.zeros((3, 300), dtype='f')
vectors = Vectors(StringStore(), vector_table)
+table(["Name", "Type", "Description"])
+row
+cell #[code strings]
+cell #[code StringStore] or list
+cell
| List of strings, or a #[+api("stringstore") #[code StringStore]]
| that maps strings to hash values, and vice versa.
+row
+cell #[code data_or_width]
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']] or int
+cell Vector data or number of dimensions.
+row("foot")
+cell returns
+cell #[code Vectors]
+cell The newly created object.
+h(2, "getitem") Vectors.__getitem__
+tag method
p
| Get a vector by key. If key is a string, it is hashed to an integer ID
| using the #[code Vectors.strings] table. If the integer key is not found
| in the table, a #[code KeyError] is raised.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
cat_vector = vectors[u'cat']
+table(["Name", "Type", "Description"])
+row
+cell #[code key]
+cell unicode / int
+cell The key to get the vector for.
+row
+cell returns
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell The vector for the key.
+h(2, "setitem") Vectors.__setitem__
+tag method
p
| Set a vector for the given key. If key is a string, it is hashed to an
| integer ID using the #[code Vectors.strings] table.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
+table(["Name", "Type", "Description"])
+row
+cell #[code key]
+cell unicode / int
+cell The key to set the vector for.
+row
+cell #[code vector]
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell The vector to set.
+h(2, "iter") Vectors.__iter__
+tag method
p Yield vectors from the table.
+aside-code("Example").
vector_table = numpy.zeros((3, 300), dtype='f')
vectors = Vectors(StringStore(), vector_table)
for vector in vectors:
print(vector)
+table(["Name", "Type", "Description"])
+row("foot")
+cell yields
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell A vector from the table.
+h(2, "len") Vectors.__len__
+tag method
p Return the number of vectors that have been assigned.
+aside-code("Example").
vector_table = numpy.zeros((3, 300), dtype='f')
vectors = Vectors(StringStore(), vector_table)
assert len(vectors) == 3
+table(["Name", "Type", "Description"])
+row("foot")
+cell returns
+cell int
+cell The number of vectors in the data.
+h(2, "contains") Vectors.__contains__
+tag method
p
| Check whether a key has a vector entry in the table. If key is a string,
| it is hashed to an integer ID using the #[code Vectors.strings] table.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
assert u'cat' in vectors
+table(["Name", "Type", "Description"])
+row
+cell #[code key]
+cell unicode / int
+cell The key to check.
+row("foot")
+cell returns
+cell bool
+cell Whether the key has a vector entry.
+h(2, "add") Vectors.add
+tag method
p
| Add a key to the table, optionally setting a vector value as well. If
| key is a string, it is hashed to an integer ID using the
| #[code Vectors.strings] table.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
+table(["Name", "Type", "Description"])
+row
+cell #[code key]
+cell unicode / int
+cell The key to add.
+row
+cell #[code vector]
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell An optional vector to add.
+h(2, "items") Vectors.items
+tag method
p Iterate over #[code (string key, vector)] pairs, in order.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
for key, vector in vectors.items():
print(key, vector)
+table(["Name", "Type", "Description"])
+row("foot")
+cell yields
+cell tuple
+cell #[code (string key, vector)] pairs, in order.
+h(2, "shape") Vectors.shape
+tag property
p
| Get #[code (rows, dims)] tuples of number of rows and number of
| dimensions in the vector table.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
rows, dims = vectors.shape
assert rows == 1
assert dims == 300
+table(["Name", "Type", "Description"])
+row("foot")
+cell returns
+cell tuple
+cell #[code (rows, dims)] pairs.
+h(2, "from_glove") Vectors.from_glove
+tag method
p
| Load #[+a("https://nlp.stanford.edu/projects/glove/") GloVe] vectors from
| a directory. Assumes binary format, that the vocab is in a
| #[code vocab.txt], and that vectors are named
| #[code vectors.{size}.[fd].bin], e.g. #[code vectors.128.f.bin] for 128d
| float32 vectors, #[code vectors.300.d.bin] for 300d float64 (double)
| vectors, etc. By default GloVe outputs 64-bit vectors.
+table(["Name", "Type", "Description"])
+row
+cell #[code path]
+cell unicode / #[code Path]
+cell The path to load the GloVe vectors from.
+h(2, "to_disk") Vectors.to_disk
+tag method
p Save the current state to a directory.
+aside-code("Example").
vectors.to_disk('/path/to/vectors')
+table(["Name", "Type", "Description"])
+row
+cell #[code path]
+cell unicode or #[code Path]
+cell
| A path to a directory, which will be created if it doesn't exist.
| Paths may be either strings or #[code Path]-like objects.
+h(2, "from_disk") Vectors.from_disk
+tag method
p Loads state from a directory. Modifies the object in place and returns it.
+aside-code("Example").
vectors = Vectors(StringStore())
vectors.from_disk('/path/to/vectors')
+table(["Name", "Type", "Description"])
+row
+cell #[code path]
+cell unicode or #[code Path]
+cell
| A path to a directory. Paths may be either strings or
| #[code Path]-like objects.
+row("foot")
+cell returns
+cell #[code Vectors]
+cell The modified #[code Vectors] object.
+h(2, "to_bytes") Vectors.to_bytes
+tag method
p Serialize the current state to a binary string.
+aside-code("Example").
vectors_bytes = vectors.to_bytes()
+table(["Name", "Type", "Description"])
+row
+cell #[code **exclude]
+cell -
+cell Named attributes to prevent from being serialized.
+row("foot")
+cell returns
+cell bytes
+cell The serialized form of the #[code Vectors] object.
+h(2, "from_bytes") Vectors.from_bytes
+tag method
p Load state from a binary string.
+aside-code("Example").
fron spacy.vectors import Vectors
vectors_bytes = vectors.to_bytes()
new_vectors = Vectors(StringStore())
new_vectors.from_bytes(vectors_bytes)
+table(["Name", "Type", "Description"])
+row
+cell #[code bytes_data]
+cell bytes
+cell The data to load from.
+row
+cell #[code **exclude]
+cell -
+cell Named attributes to prevent from being loaded.
+row("foot")
+cell returns
+cell #[code Vectors]
+cell The #[code Vectors] object.
+h(2, "attributes") Attributes
+table(["Name", "Type", "Description"])
+row
+cell #[code data]
+cell #[code numpy.ndarray] / #[code cupy.ndarray]
+cell
| Stored vectors data. #[code numpy] is used for CPU vectors,
| #[code cupy] for GPU vectors.
+row
+cell #[code key2row]
+cell dict
+cell
| Dictionary mapping word hashes to rows in the
| #[code Vectors.data] table.
+row
+cell #[code keys]
+cell #[code numpy.ndarray]
+cell
| Array keeping the keys in order, such that
| #[code keys[vectors.key2row[key]] == key]

View File

@ -1,17 +1,22 @@
//- 💫 DOCS > API > VOCAB
include ../../_includes/_mixins
include ../_includes/_mixins
p
| A lookup table that allows you to access #[code Lexeme] objects. The
| #[code Vocab] instance also provides access to the #[code StringStore],
| and owns underlying C-data that is shared between #[code Doc] objects.
| The #[code Vocab] object provides a lookup table that allows you to
| access #[+api("lexeme") #[code Lexeme]] objects, as well as the
| #[+api("stringstore") #[code StringStore]]. It also owns underlying
| C-data that is shared between #[code Doc] objects.
+h(2, "init") Vocab.__init__
+tag method
p Create the vocabulary.
+aside-code("Example").
from spacy.vocab import Vocab
vocab = Vocab(strings=[u'hello', u'world'])
+table(["Name", "Type", "Description"])
+row
+cell #[code lex_attr_getters]
@ -39,7 +44,7 @@ p Create the vocabulary.
| A #[+api("stringstore") #[code StringStore]] that maps
| strings to hash values, and vice versa, or a list of strings.
+footrow
+row("foot")
+cell returns
+cell #[code Vocab]
+cell The newly constructed object.
@ -54,7 +59,7 @@ p Get the current number of lexemes in the vocabulary.
assert len(nlp.vocab) > 0
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell returns
+cell int
+cell The number of lexems in the vocabulary.
@ -76,7 +81,7 @@ p
+cell int / unicode
+cell The hash value of a word, or its unicode string.
+footrow
+row("foot")
+cell returns
+cell #[code Lexeme]
+cell The lexeme indicated by the given ID.
@ -90,7 +95,7 @@ p Iterate over the lexemes in the vocabulary.
stop_words = (lex for lex in nlp.vocab if lex.is_stop)
+table(["Name", "Type", "Description"])
+footrow
+row("foot")
+cell yields
+cell #[code Lexeme]
+cell An entry in the vocabulary.
@ -115,7 +120,7 @@ p
+cell unicode
+cell The ID string.
+footrow
+row("foot")
+cell returns
+cell bool
+cell Whether the string has an entry in the vocabulary.
@ -152,11 +157,100 @@ p
| which the flag will be stored. If #[code -1], the lowest
| available bit will be chosen.
+footrow
+row("foot")
+cell returns
+cell int
+cell The integer ID by which the flag value can be checked.
+h(2, "add_flag") Vocab.clear_vectors
+tag method
+tag-new(2)
p
| Drop the current vector table. Because all vectors must be the same
| width, you have to call this to change the size of the vectors.
+aside-code("Example").
nlp.vocab.clear_vectors(new_dim=300)
+table(["Name", "Type", "Description"])
+row
+cell #[code new_dim]
+cell int
+cell
| Number of dimensions of the new vectors. If #[code None], size
| is not changed.
+h(2, "add_flag") Vocab.get_vector
+tag method
+tag-new(2)
p
| Retrieve a vector for a word in the vocabulary. Words can be looked up
| by string or hash value. If no vectors data is loaded, a
| #[code ValueError] is raised.
+aside-code("Example").
nlp.vocab.get_vector(u'apple')
+table(["Name", "Type", "Description"])
+row
+cell #[code orth]
+cell int / unicode
+cell The hash value of a word, or its unicode string.
+row("foot")
+cell returns
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell
| A word vector. Size and shape are determined by the
| #[code Vocab.vectors] instance.
+h(2, "add_flag") Vocab.set_vector
+tag method
+tag-new(2)
p
| Set a vector for a word in the vocabulary. Words can be referenced by
| by string or hash value.
+aside-code("Example").
nlp.vocab.set_vector(u'apple', array([...]))
+table(["Name", "Type", "Description"])
+row
+cell #[code orth]
+cell int / unicode
+cell The hash value of a word, or its unicode string.
+row
+cell #[code vector]
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell The vector to set.
+h(2, "add_flag") Vocab.has_vector
+tag method
+tag-new(2)
p
| Check whether a word has a vector. Returns #[code False] if no vectors
| are loaded. Words can be looked up by string or hash value.
+aside-code("Example").
if nlp.vocab.has_vector(u'apple'):
vector = nlp.vocab.get_vector(u'apple')
+table(["Name", "Type", "Description"])
+row
+cell #[code orth]
+cell int / unicode
+cell The hash value of a word, or its unicode string.
+row("foot")
+cell returns
+cell bool
+cell Whether the word has a vector.
+h(2, "to_disk") Vocab.to_disk
+tag method
+tag-new(2)
@ -192,7 +286,7 @@ p Loads state from a directory. Modifies the object in place and returns it.
| A path to a directory. Paths may be either strings or
| #[code Path]-like objects.
+footrow
+row("foot")
+cell returns
+cell #[code Vocab]
+cell The modified #[code Vocab] object.
@ -211,7 +305,7 @@ p Serialize the current state to a binary string.
+cell -
+cell Named attributes to prevent from being serialized.
+footrow
+row("foot")
+cell returns
+cell bytes
+cell The serialized form of the #[code Vocab] object.
@ -238,7 +332,7 @@ p Load state from a binary string.
+cell -
+cell Named attributes to prevent from being loaded.
+footrow
+row("foot")
+cell returns
+cell #[code Vocab]
+cell The #[code Vocab] object.
@ -256,3 +350,14 @@ p Load state from a binary string.
+cell #[code strings]
+cell #[code StringStore]
+cell A table managing the string-to-int mapping.
+row
+cell #[code vectors]
+tag-new(2)
+cell #[code Vectors]
+cell A table associating word IDs to word vectors.
+row
+cell #[code vectors_length]
+cell int
+cell Number of dimensions for each word vector.

View File

@ -1,156 +0,0 @@
//- 💫 DOCS > API > ANNOTATION SPECS
include ../../_includes/_mixins
p This document describes the target annotations spaCy is trained to predict.
+h(2, "tokenization") Tokenization
p
| Tokenization standards are based on the
| #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus.
| The tokenizer differs from most by including tokens for significant
| whitespace. Any sequence of whitespace characters beyond a single space
| (#[code ' ']) is included as a token.
+aside-code("Example").
from spacy.lang.en import English
nlp = English()
tokens = nlp('Some\nspaces and\ttab characters')
tokens_text = [t.text for t in tokens]
assert tokens_text == ['Some', '\n', 'spaces', ' ', 'and',
'\t', 'tab', 'characters']
p
| The whitespace tokens are useful for much the same reason punctuation is
| it's often an important delimiter in the text. By preserving it in the
| token output, we are able to maintain a simple alignment between the
| tokens and the original string, and we ensure that no information is
| lost during processing.
+h(2, "sentence-boundary") Sentence boundary detection
p
| Sentence boundaries are calculated from the syntactic parse tree, so
| features such as punctuation and capitalisation play an important but
| non-decisive role in determining the sentence boundaries. Usually this
| means that the sentence boundaries will at least coincide with clause
| boundaries, even given poorly punctuated text.
+h(2, "pos-tagging") Part-of-speech Tagging
+aside("Tip: Understanding tags")
| You can also use #[code spacy.explain()] to get the description for the
| string representation of a tag. For example,
| #[code spacy.explain("RB")] will return "adverb".
include _annotation/_pos-tags
+h(2, "lemmatization") Lemmatization
p A "lemma" is the uninflected form of a word. In English, this means:
+list
+item #[strong Adjectives]: The form like "happy", not "happier" or "happiest"
+item #[strong Adverbs]: The form like "badly", not "worse" or "worst"
+item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
+item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
p
| The lemmatization data is taken from
| #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
| special case for pronouns: all pronouns are lemmatized to the special
| token #[code -PRON-].
+infobox("About spaCy's custom pronoun lemma")
| Unlike verbs and common nouns, there's no clear base form of a personal
| pronoun. Should the lemma of "me" be "I", or should we normalize person
| as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
| novel symbol, #[code -PRON-], which is used as the lemma for
| all personal pronouns.
+h(2, "dependency-parsing") Syntactic Dependency Parsing
+aside("Tip: Understanding labels")
| You can also use #[code spacy.explain()] to get the description for the
| string representation of a label. For example,
| #[code spacy.explain("prt")] will return "particle".
include _annotation/_dep-labels
+h(2, "named-entities") Named Entity Recognition
+aside("Tip: Understanding entity types")
| You can also use #[code spacy.explain()] to get the description for the
| string representation of an entity label. For example,
| #[code spacy.explain("LANGUAGE")] will return "any named language".
include _annotation/_named-entities
+h(3, "biluo") BILUO Scheme
p
| spaCy translates character offsets into the BILUO scheme, in order to
| decide the cost of each action given the current state of the entity
| recognizer. The costs are then used to calculate the gradient of the
| loss, to train the model.
+aside("Why BILUO, not IOB?")
| There are several coding schemes for encoding entity annotations as
| token tags. These coding schemes are equally expressive, but not
| necessarily equally learnable.
| #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
| showed that the minimal #[strong Begin], #[strong In], #[strong Out]
| scheme was more difficult to learn than the #[strong BILUO] scheme that
| we use, which explicitly marks boundary tokens.
+table([ "Tag", "Description" ])
+row
+cell #[code #[span.u-color-theme B] EGIN]
+cell The first token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme I] N]
+cell An inner token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme L] AST]
+cell The final token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme U] NIT]
+cell A single-token entity.
+row
+cell #[code #[span.u-color-theme O] UT]
+cell A non-entity token.
+h(2, "json-input") JSON input format for training
p
| spaCy takes training data in the following format:
+code("Example structure").
doc: {
id: string,
paragraphs: [{
raw: string,
sents: [int],
tokens: [{
start: int,
tag: string,
head: int,
dep: string
}],
ner: [{
start: int,
end: int,
label: string
}],
brackets: [{
start: int,
end: int,
label: string
}]
}]
}

View File

@ -1,111 +0,0 @@
//- 💫 DOCS > API > DEPENDENCYPARSER
include ../../_includes/_mixins
p Annotate syntactic dependencies on #[code Doc] objects.
+under-construction
+h(2, "init") DependencyParser.__init__
+tag method
p Create a #[code DependencyParser].
+table(["Name", "Type", "Description"])
+row
+cell #[code vocab]
+cell #[code Vocab]
+cell The vocabulary. Must be shared with documents to be processed.
+row
+cell #[code model]
+cell #[thinc.linear.AveragedPerceptron]
+cell The statistical model.
+footrow
+cell returns
+cell #[code DependencyParser]
+cell The newly constructed object.
+h(2, "call") DependencyParser.__call__
+tag method
p
| Apply the dependency parser, setting the heads and dependency relations
| onto the #[code Doc] object.
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The document to be processed.
+footrow
+cell returns
+cell #[code None]
+cell -
+h(2, "pipe") DependencyParser.pipe
+tag method
p Process a stream of documents.
+table(["Name", "Type", "Description"])
+row
+cell #[code stream]
+cell -
+cell The sequence of documents to process.
+row
+cell #[code batch_size]
+cell int
+cell The number of documents to accumulate into a working set.
+row
+cell #[code n_threads]
+cell int
+cell
| The number of threads with which to work on the buffer in
| parallel.
+footrow
+cell yields
+cell #[code Doc]
+cell Documents, in order.
+h(2, "update") DependencyParser.update
+tag method
p Update the statistical model.
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The example document for the update.
+row
+cell #[code gold]
+cell #[code GoldParse]
+cell The gold-standard annotations, to calculate the loss.
+footrow
+cell returns
+cell int
+cell The loss on this example.
+h(2, "step_through") DependencyParser.step_through
+tag method
p Set up a stepwise state, to introspect and control the transition sequence.
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The document to step through.
+footrow
+cell returns
+cell #[code StepwiseState]
+cell A state object, to step through the annotation process.

View File

@ -1,109 +0,0 @@
//- 💫 DOCS > API > ENTITYRECOGNIZER
include ../../_includes/_mixins
p Annotate named entities on #[code Doc] objects.
+under-construction
+h(2, "init") EntityRecognizer.__init__
+tag method
p Create an #[code EntityRecognizer].
+table(["Name", "Type", "Description"])
+row
+cell #[code vocab]
+cell #[code Vocab]
+cell The vocabulary. Must be shared with documents to be processed.
+row
+cell #[code model]
+cell #[thinc.linear.AveragedPerceptron]
+cell The statistical model.
+footrow
+cell returns
+cell #[code EntityRecognizer]
+cell The newly constructed object.
+h(2, "call") EntityRecognizer.__call__
+tag method
p Apply the entity recognizer, setting the NER tags onto the #[code Doc] object.
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The document to be processed.
+footrow
+cell returns
+cell #[code None]
+cell -
+h(2, "pipe") EntityRecognizer.pipe
+tag method
p Process a stream of documents.
+table(["Name", "Type", "Description"])
+row
+cell #[code stream]
+cell -
+cell The sequence of documents to process.
+row
+cell #[code batch_size]
+cell int
+cell The number of documents to accumulate into a working set.
+row
+cell #[code n_threads]
+cell int
+cell
| The number of threads with which to work on the buffer in
| parallel.
+footrow
+cell yields
+cell #[code Doc]
+cell Documents, in order.
+h(2, "update") EntityRecognizer.update
+tag method
p Update the statistical model.
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The example document for the update.
+row
+cell #[code gold]
+cell #[code GoldParse]
+cell The gold-standard annotations, to calculate the loss.
+footrow
+cell returns
+cell int
+cell The loss on this example.
+h(2, "step_through") EntityRecognizer.step_through
+tag method
p Set up a stepwise state, to introspect and control the transition sequence.
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The document to step through.
+footrow
+cell returns
+cell #[code StepwiseState]
+cell A state object, to step through the annotation process.

View File

@ -1,241 +0,0 @@
//- 💫 DOCS > API > FACTS & FIGURES
include ../../_includes/_mixins
+under-construction
+h(2, "comparison") Feature comparison
p
| Here's a quick comparison of the functionalities offered by spaCy,
| #[+a("https://github.com/tensorflow/models/tree/master/syntaxnet") SyntaxNet],
| #[+a("http://www.nltk.org/py-modindex.html") NLTK] and
| #[+a("http://stanfordnlp.github.io/CoreNLP/") CoreNLP].
+table([ "", "spaCy", "SyntaxNet", "NLTK", "CoreNLP"])
+row
+cell Easy installation
each icon in [ "pro", "con", "pro", "pro" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Python API
each icon in [ "pro", "con", "pro", "con" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Multi-language support
each icon in [ "neutral", "pro", "pro", "pro" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Tokenization
each icon in [ "pro", "pro", "pro", "pro" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Part-of-speech tagging
each icon in [ "pro", "pro", "pro", "pro" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Sentence segmentation
each icon in [ "pro", "pro", "pro", "pro" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Dependency parsing
each icon in [ "pro", "pro", "con", "pro" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Entity Recognition
each icon in [ "pro", "con", "pro", "pro" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Integrated word vectors
each icon in [ "pro", "con", "con", "con" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Sentiment analysis
each icon in [ "pro", "con", "pro", "pro" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Coreference resolution
each icon in [ "con", "con", "con", "pro" ]
+cell.u-text-center #[+procon(icon)]
+h(2, "benchmarks") Benchmarks
p
| Two peer-reviewed papers in 2015 confirm that spaCy offers the
| #[strong fastest syntactic parser in the world] and that
| #[strong its accuracy is within 1% of the best] available. The few
| systems that are more accurate are 20× slower or more.
+aside("About the evaluation")
| The first of the evaluations was published by #[strong Yahoo! Labs] and
| #[strong Emory University], as part of a survey of current parsing
| technologies #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") (Choi et al., 2015)].
| Their results and subsequent discussions helped us develop a novel
| psychologically-motivated technique to improve spaCy's accuracy, which
| we published in joint work with Macquarie University
| #[+a("https://aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].
+table([ "System", "Language", "Accuracy", "Speed (wps)"])
+row
each data in [ "spaCy", "Cython", "91.8", "13,963" ]
+cell #[strong=data]
+row
each data in [ "ClearNLP", "Java", "91.7", "10,271" ]
+cell=data
+row
each data in [ "CoreNLP", "Java", "89.6", "8,602"]
+cell=data
+row
each data in [ "MATE", "Java", "92.5", "550"]
+cell=data
+row
each data in [ "Turbo", "C++", "92.4", "349" ]
+cell=data
+h(3, "parse-accuracy") Parse accuracy
p
| In 2016, Google released their
| #[+a("https://github.com/tensorflow/models/tree/master/syntaxnet") SyntaxNet]
| library, setting a new state of the art for syntactic dependency parsing
| accuracy. SyntaxNet's algorithm is very similar to spaCy's. The main
| difference is that SyntaxNet uses a neural network while spaCy uses a
| sparse linear model.
+aside("Methodology")
| #[+a("http://arxiv.org/abs/1603.06042") Andor et al. (2016)] chose
| slightly different experimental conditions from
| #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") Choi et al. (2015)],
| so the two accuracy tables here do not present directly comparable
| figures. We have only evaluated spaCy in the "News" condition following
| the SyntaxNet methodology. We don't yet have benchmark figures for the
| "Web" and "Questions" conditions.
+table([ "System", "News", "Web", "Questions" ])
+row
+cell spaCy
each data in [ 92.8, "n/a", "n/a" ]
+cell=data
+row
+cell #[+a("https://github.com/tensorflow/models/tree/master/syntaxnet") Parsey McParseface]
each data in [ 94.15, 89.08, 94.77 ]
+cell=data
+row
+cell #[+a("http://www.cs.cmu.edu/~ark/TurboParser/") Martins et al. (2013)]
each data in [ 93.10, 88.23, 94.21 ]
+cell=data
+row
+cell #[+a("http://research.google.com/pubs/archive/38148.pdf") Zhang and McDonald (2014)]
each data in [ 93.32, 88.65, 93.37 ]
+cell=data
+row
+cell #[+a("http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43800.pdf") Weiss et al. (2015)]
each data in [ 93.91, 89.29, 94.17 ]
+cell=data
+row
+cell #[strong #[+a("http://arxiv.org/abs/1603.06042") Andor et al. (2016)]]
each data in [ 94.44, 90.17, 95.40 ]
+cell #[strong=data]
+h(3, "speed-comparison") Detailed speed comparison
p
| Here we compare the per-document processing time of various spaCy
| functionalities against other NLP libraries. We show both absolute
| timings (in ms) and relative performance (normalized to spaCy). Lower is
| better.
+aside("Methodology")
| #[strong Set up:] 100,000 plain-text documents were streamed from an
| SQLite3 database, and processed with an NLP library, to one of three
| levels of detail — tokenization, tagging, or parsing. The tasks are
| additive: to parse the text you have to tokenize and tag it. The
| pre-processing was not subtracted from the times — I report the time
| required for the pipeline to complete. I report mean times per document,
| in milliseconds.#[br]#[br]
| #[strong Hardware]: Intel i7-3770 (2012)#[br]
| #[strong Implementation]: #[+src(gh("spacy-benchmarks")) spacy-benchmarks]
+table
+row.u-text-label.u-text-center
th.c-table__head-cell
th.c-table__head-cell(colspan="3") Absolute (ms per doc)
th.c-table__head-cell(colspan="3") Relative (to spaCy)
+row
each column in ["System", "Tokenize", "Tag", "Parse", "Tokenize", "Tag", "Parse"]
th.c-table__head-cell.u-text-label=column
+row
+cell #[strong spaCy]
each data in [ "0.2ms", "1ms", "19ms"]
+cell #[strong=data]
each data in [ "1x", "1x", "1x" ]
+cell=data
+row
each data in [ "CoreNLP", "2ms", "10ms", "49ms", "10x", "10x", "2.6x"]
+cell=data
+row
each data in [ "ZPar", "1ms", "8ms", "850ms", "5x", "8x", "44.7x" ]
+cell=data
+row
each data in [ "NLTK", "4ms", "443ms", "n/a", "20x", "443x", "n/a" ]
+cell=data
+h(3, "ner") Named entity comparison
p
| #[+a("https://aclweb.org/anthology/W/W16/W16-2703.pdf") Jiang et al. (2016)]
| present several detailed comparisons of the named entity recognition
| models provided by spaCy, CoreNLP, NLTK and LingPipe. Here we show their
| evaluation of person, location and organization accuracy on Wikipedia.
+aside("Methodology")
| Making a meaningful comparison of different named entity recognition
| systems is tricky. Systems are often trained on different data, which
| usually have slight differences in annotation style. For instance, some
| corpora include titles as part of person names, while others don't.
| These trivial differences in convention can distort comparisons
| significantly. Jiang et al.'s #[em partial overlap] metric goes a long
| way to solving this problem.
+table([ "System", "Precision", "Recall", "F-measure" ])
+row
+cell spaCy
each data in [ 0.7240, 0.6514, 0.6858 ]
+cell=data
+row
+cell #[strong CoreNLP]
each data in [ 0.7914, 0.7327, 0.7609 ]
+cell #[strong=data]
+row
+cell NLTK
each data in [ 0.5136, 0.6532, 0.5750 ]
+cell=data
+row
+cell LingPipe
each data in [ 0.5412, 0.5357, 0.5384 ]
+cell=data

View File

@ -1,93 +0,0 @@
//- 💫 DOCS > API > LANGUAGE MODELS
include ../../_includes/_mixins
p
| spaCy currently provides models for the following languages and
| capabilities:
+aside-code("Download language models", "bash").
spacy download en
spacy download de
spacy download fr
+table([ "Language", "Token", "SBD", "Lemma", "POS", "NER", "Dep", "Vector", "Sentiment"])
+row
+cell English #[code en]
each icon in [ "pro", "pro", "pro", "pro", "pro", "pro", "pro", "con" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell German #[code de]
each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell French #[code fr]
each icon in [ "pro", "con", "con", "pro", "con", "pro", "pro", "con" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Spanish #[code es]
each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
+cell.u-text-center #[+procon(icon)]
p
+button("/docs/usage/models", true, "primary") See available models
+h(2, "alpha-support") Alpha tokenization support
p
| Work has started on the following languages. You can help by
| #[+a("/docs/usage/adding-languages#language-data") improving the existing language data]
| and extending the tokenization patterns.
+aside("Usage note")
| Note that the alpha languages don't yet come with a language model. In
| order to use them, you have to import them directly:
+code.o-no-block.
from spacy.lang.fi import Finnish
nlp = Finnish()
doc = nlp(u'Ilmatyynyalukseni on täynnä ankeriaita')
+infobox("Dependencies")
| Some language tokenizers require external dependencies. To use #[strong Chinese],
| you need to have #[+a("https://github.com/fxsjy/jieba") Jieba] installed.
| The #[strong Japanese] tokenizer requires
| #[+a("https://github.com/mocobeta/janome") Janome].
+table([ "Language", "Code", "Source" ])
each language, code in { it: "Italian", pt: "Portuguese", nl: "Dutch", sv: "Swedish", fi: "Finnish", nb: "Norwegian Bokmål", da: "Danish", hu: "Hungarian", pl: "Polish", bn: "Bengali", he: "Hebrew", zh: "Chinese", ja: "Japanese" }
+row
+cell #{language}
+cell #[code=code]
+cell
+src(gh("spaCy", "spacy/lang/" + code)) lang/#{code}
+h(2, "multi-language") Multi-language support
+tag-new(2)
p
| As of v2.0, spaCy supports models trained on more than one language. This
| is especially useful for named entity recognition. The language ID used
| for multi-language or language-neutral models is #[code xx]. The
| language class, a generic subclass containing only the base language data,
| can be found in #[+src(gh("spaCy", "spacy/lang/xx")) lang/xx].
p
| To load your model with the neutral, multi-language class, simply set
| #[code "language": "xx"] in your
| #[+a("/docs/usage/saving-loading#models-generating") model package]'s
| meta.json. You can also import the class directly, or call
| #[+api("util#get_lang_class") #[code util.get_lang_class()]] for
| lazy-loading.
+code("Standard import").
from spacy.lang.xx import MultiLanguage
nlp = MultiLanguage()
+code("With lazy-loading").
from spacy.util import get_lang_class
nlp = get_lang_class('xx')

View File

@ -1,93 +0,0 @@
//- 💫 DOCS > API > TAGGER
include ../../_includes/_mixins
p Annotate part-of-speech tags on #[code Doc] objects.
+under-construction
+h(2, "init") Tagger.__init__
+tag method
p Create a #[code Tagger].
+table(["Name", "Type", "Description"])
+row
+cell #[code vocab]
+cell #[code Vocab]
+cell The vocabulary. Must be shared with documents to be processed.
+row
+cell #[code model]
+cell #[thinc.linear.AveragedPerceptron]
+cell The statistical model.
+footrow
+cell returns
+cell #[code Tagger]
+cell The newly constructed object.
+h(2, "call") Tagger.__call__
+tag method
p Apply the tagger, setting the POS tags onto the #[code Doc] object.
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The tokens to be tagged.
+footrow
+cell returns
+cell #[code None]
+cell -
+h(2, "pipe") Tagger.pipe
+tag method
p Tag a stream of documents.
+table(["Name", "Type", "Description"])
+row
+cell #[code stream]
+cell -
+cell The sequence of documents to tag.
+row
+cell #[code batch_size]
+cell int
+cell The number of documents to accumulate into a working set.
+row
+cell #[code n_threads]
+cell int
+cell
| The number of threads with which to work on the buffer in
| parallel.
+footrow
+cell yields
+cell #[code Doc]
+cell Documents, in order.
+h(2, "update") Tagger.update
+tag method
p Update the statistical model, with tags supplied for the given document.
+table(["Name", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The example document for the update.
+row
+cell #[code gold]
+cell #[code GoldParse]
+cell Manager for the gold-standard tags.
+footrow
+cell returns
+cell int
+cell Number of tags predicted correctly.

View File

@ -1,7 +0,0 @@
//- 💫 DOCS > API > TENSORIZER
include ../../_includes/_mixins
p Add a tensor with position-sensitive meaning representations to a #[code Doc].
+under-construction

View File

@ -1,21 +0,0 @@
//- 💫 DOCS > API > TEXTCATEGORIZER
include ../../_includes/_mixins
p
| Add text categorization models to spaCy pipelines. The model supports
| classification with multiple, non-mutually exclusive labels.
p
| You can change the model architecture rather easily, but by default, the
| #[code TextCategorizer] class uses a convolutional neural network to
| assign position-sensitive vectors to each word in the document. This step
| is similar to the #[+api("tensorizer") #[code Tensorizer]] component, but the
| #[code TextCategorizer] uses its own CNN model, to avoid sharing weights
| with the other pipeline components. The document tensor is then
| summarized by concatenating max and mean pooling, and a multilayer
| perceptron is used to predict an output vector of length #[code nr_class],
| before a logistic activation is applied elementwise. The value of each
| output neuron is the probability that some class is present.
+under-construction

View File

@ -1,7 +0,0 @@
//- 💫 DOCS > API > VECTORS
include ../../_includes/_mixins
p A container class for vector data keyed by string.
+under-construction

View File

@ -0,0 +1,72 @@
//- 💫 DOCS > USAGE > MODELS > LANGUAGE SUPPORT
p spaCy currently provides models for the following languages:
+table(["Language", "Code", "Language data", "Models"])
for models, code in MODELS
- var count = Object.keys(models).length
+row
+cell=LANGUAGES[code]
+cell #[code=code]
+cell
+src(gh("spaCy", "spacy/lang/" + code)) #[code lang/#{code}]
+cell
+a("/models/" + code) #{count} #{(count == 1) ? "model" : "models"}
+h(3, "alpha-support") Alpha tokenization support
p
| Work has started on the following languages. You can help by
| #[+a("/usage/adding-languages#language-data") improving the existing language data]
| and extending the tokenization patterns.
+aside("Usage note")
| Note that the alpha languages don't yet come with a language model. In
| order to use them, you have to import them directly, or use
| #[+api("spacy#blank") #[code spacy.blank]]:
+code.o-no-block.
from spacy.lang.fi import Finnish
nlp = Finnish() # use directly
nlp = spacy.blank('fi') # blank instance
+table(["Language", "Code", "Language data"])
for lang, code in LANGUAGES
if !Object.keys(MODELS).includes(code)
+row
+cell #{LANGUAGES[code]}
+cell #[code=code]
+cell
+src(gh("spaCy", "spacy/lang/" + code)) #[code lang/#{code}]
+infobox("Dependencies")
| Some language tokenizers require external dependencies. To use #[strong Chinese],
| you need to have #[+a("https://github.com/fxsjy/jieba") Jieba] installed.
| The #[strong Japanese] tokenizer requires
| #[+a("https://github.com/mocobeta/janome") Janome].
+h(3, "multi-language") Multi-language support
+tag-new(2)
p
| As of v2.0, spaCy supports models trained on more than one language. This
| is especially useful for named entity recognition. The language ID used
| for multi-language or language-neutral models is #[code xx]. The
| language class, a generic subclass containing only the base language data,
| can be found in #[+src(gh("spaCy", "spacy/lang/xx")) #[code lang/xx]].
p
| To load your model with the neutral, multi-language class, simply set
| #[code "language": "xx"] in your
| #[+a("/usage/training#models-generating") model package]'s
| meta.json. You can also import the class directly, or call
| #[+api("util#get_lang_class") #[code util.get_lang_class()]] for
| lazy-loading.
+code("Standard import").
from spacy.lang.xx import MultiLanguage
nlp = MultiLanguage()
+code("With lazy-loading").
from spacy.util import get_lang_class
nlp = get_lang_class('xx')