diff --git a/website/docs/images/language_data.svg b/website/docs/images/language_data.svg deleted file mode 100644 index 58482b2c5..000000000 --- a/website/docs/images/language_data.svg +++ /dev/null @@ -1,85 +0,0 @@ - - - - - - Tokenizer - - - - - - - - - - Base data - - - - - - - - - - - - - - - - Language data - - - - stop words - - - - lexical attributes - - - - - - tokenizer exceptions - - - - - - prefixes, suffixes, infixes - - - - - lemma data - - - - Lemmatizer - - - - char classes - - Token - - - - morph rules - - - - tag map - - Morphology - diff --git a/website/docs/images/tokenization.svg b/website/docs/images/tokenization.svg index 9877e1a30..d676fdace 100644 --- a/website/docs/images/tokenization.svg +++ b/website/docs/images/tokenization.svg @@ -1,123 +1,305 @@ - - - - - “Let’s - - - go - - - to - - - N.Y.!” - - - “ - - - Let’s - - - go - - - to - - - N.Y.!” - - “ - - - Let - - - go - - - to - - - N.Y.!” - - - ’s - - - “ - - - Let - - - go - - - to - - - N.Y.! - - - ’s - - - ” - - - “ - - - Let - - - go - - - to - - - N.Y. - - - ’s - - - ” - - - ! - - “ - - Let - - go - - to - - N.Y. - - ’s - - ” - - ! - - EXCEPTION - - PREFIX - - SUFFIX - - SUFFIX - - EXCEPTION - - DONE + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/training-loop.svg b/website/docs/images/training-loop.svg deleted file mode 100644 index 144fe2d3d..000000000 --- a/website/docs/images/training-loop.svg +++ /dev/null @@ -1,40 +0,0 @@ - - - - - - - - Training data - - - - label - - - - text - - - - - - Doc - - - - Example - - - - update - - nlp - - - - optimizer - diff --git a/website/docs/images/vocab_stringstore.svg b/website/docs/images/vocab_stringstore.svg index b604041f2..e10ff3c58 100644 --- a/website/docs/images/vocab_stringstore.svg +++ b/website/docs/images/vocab_stringstore.svg @@ -1,77 +1,118 @@ - - - - - 31979... - - Lexeme - - 46904... - - Lexeme - - 37020... - - Lexeme - - - "coffee" - - 31979… - - "I" - - 46904… - - "love" - - 37020… - - - - - nsubj - - - - dobj - - String - Store - - Vocab - - Doc - - love - VERB - - Token - - I - PRON - - Token - - coffee - NOUN - - Token - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/usage/101/_language-data.md b/website/docs/usage/101/_language-data.md index 2917b19c4..8c3cd48a3 100644 --- a/website/docs/usage/101/_language-data.md +++ b/website/docs/usage/101/_language-data.md @@ -10,8 +10,9 @@ The **shared language data** in the directory root includes rules that can be generalized across languages – for example, rules for basic punctuation, emoji, emoticons and single-letter abbreviations. The **individual language data** in a submodule contains rules that are only relevant to a particular language. It -also takes care of putting together all components and creating the `Language` -subclass – for example, `English` or `German`. +also takes care of putting together all components and creating the +[`Language`](/api/language) subclass – for example, `English` or `German`. The +values are defined in the [`Language.Defaults`](/api/language#defaults). > ```python > from spacy.lang.en import English @@ -21,14 +22,6 @@ subclass – for example, `English` or `German`. > nlp_de = German() # Includes German data > ``` - - - - | Name | Description | | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Stop words**[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. | diff --git a/website/docs/usage/spacy-101.md b/website/docs/usage/spacy-101.md index db471b1f0..27c4e3eb3 100644 --- a/website/docs/usage/spacy-101.md +++ b/website/docs/usage/spacy-101.md @@ -10,7 +10,6 @@ menu: - ['Serialization', 'serialization'] - ['Training', 'training'] - ['Language Data', 'language-data'] - - ['Lightning Tour', 'lightning-tour'] - ['Architecture', 'architecture'] - ['Community & FAQ', 'community-faq'] --- @@ -379,79 +378,6 @@ spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This will give you the object and its encoded annotations, plus the "key" to decode it. -## Knowledge base {#kb} - -To support the entity linking task, spaCy stores external knowledge in a -[`KnowledgeBase`](/api/kb). The knowledge base (KB) uses the `Vocab` to store -its data efficiently. - -> - **Mention**: A textual occurrence of a named entity, e.g. 'Miss Lovelace'. -> - **KB ID**: A unique identifier referring to a particular real-world concept, -> e.g. 'Q7259'. -> - **Alias**: A plausible synonym or description for a certain KB ID, e.g. 'Ada -> Lovelace'. -> - **Prior probability**: The probability of a certain mention resolving to a -> certain KB ID, prior to knowing anything about the context in which the -> mention is used. -> - **Entity vector**: A pretrained word vector capturing the entity -> description. - -A knowledge base is created by first adding all entities to it. Next, for each -potential mention or alias, a list of relevant KB IDs and their prior -probabilities is added. The sum of these prior probabilities should never exceed -1 for any given alias. - -```python -### {executable="true"} -import spacy -from spacy.kb import KnowledgeBase - -# load the model and create an empty KB -nlp = spacy.load('en_core_web_sm') -kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3) - -# adding entities -kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5]) -kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3]) -kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2]) - -# adding aliases -kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2]) -kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9]) - -print() -print("Number of entities in KB:",kb.get_size_entities()) # 3 -print("Number of aliases in KB:", kb.get_size_aliases()) # 2 -``` - -### Candidate generation - -Given a textual entity, the knowledge base can provide a list of plausible -candidates or entity identifiers. The [`EntityLinker`](/api/entitylinker) will -take this list of candidates as input, and disambiguate the mention to the most -probable identifier, given the document context. - -```python -### {executable="true"} -import spacy -from spacy.kb import KnowledgeBase - -nlp = spacy.load('en_core_web_sm') -kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3) - -# adding entities -kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5]) -kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3]) -kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2]) - -# adding aliases -kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2]) - -candidates = kb.get_candidates("Douglas") -for c in candidates: - print(" ", c.entity_, c.prior_prob, c.entity_vector) -``` - ## Serialization {#serialization} import Serialization101 from 'usage/101/\_serialization.md' @@ -485,384 +411,6 @@ import LanguageData101 from 'usage/101/\_language-data.md' -## Lightning tour {#lightning-tour} - -The following examples and code snippets give you an overview of spaCy's -functionality and its usage. - -### Install models and process text {#lightning-tour-models} - -```bash -python -m spacy download en_core_web_sm -python -m spacy download de_core_news_sm -``` - -```python -### {executable="true"} -import spacy - -nlp = spacy.load("en_core_web_sm") -doc = nlp("Hello, world. Here are two sentences.") -print([t.text for t in doc]) - -nlp_de = spacy.load("de_core_news_sm") -doc_de = nlp_de("Ich bin ein Berliner.") -print([t.text for t in doc_de]) - -``` - - - -**API:** [`spacy.load()`](/api/top-level#spacy.load) **Usage:** -[Models](/usage/models), [spaCy 101](/usage/spacy-101) - - - -### Get tokens, noun chunks & sentences {#lightning-tour-tokens-sentences model="parser"} - -```python -### {executable="true"} -import spacy - -nlp = spacy.load("en_core_web_sm") -doc = nlp("Peach emoji is where it has always been. Peach is the superior " - "emoji. It's outranking eggplant 🍑 ") -print(doc[0].text) # 'Peach' -print(doc[1].text) # 'emoji' -print(doc[-1].text) # '🍑' -print(doc[17:19].text) # 'outranking eggplant' - -noun_chunks = list(doc.noun_chunks) -print(noun_chunks[0].text) # 'Peach emoji' - -sentences = list(doc.sents) -assert len(sentences) == 3 -print(sentences[1].text) # 'Peach is the superior emoji.' -``` - - - -**API:** [`Doc`](/api/doc), [`Token`](/api/token) **Usage:** -[spaCy 101](/usage/spacy-101) - - - -### Get part-of-speech tags and flags {#lightning-tour-pos-tags model="tagger"} - -```python -### {executable="true"} -import spacy - -nlp = spacy.load("en_core_web_sm") -doc = nlp("Apple is looking at buying U.K. startup for $1 billion") -apple = doc[0] -print("Fine-grained POS tag", apple.pos_, apple.pos) -print("Coarse-grained POS tag", apple.tag_, apple.tag) -print("Word shape", apple.shape_, apple.shape) -print("Alphabetic characters?", apple.is_alpha) -print("Punctuation mark?", apple.is_punct) - -billion = doc[10] -print("Digit?", billion.is_digit) -print("Like a number?", billion.like_num) -print("Like an email address?", billion.like_email) -``` - - - -**API:** [`Token`](/api/token) **Usage:** -[Part-of-speech tagging](/usage/linguistic-features#pos-tagging) - - - -### Use hash values for any string {#lightning-tour-hashes} - -```python -### {executable="true"} -import spacy - -nlp = spacy.load("en_core_web_sm") -doc = nlp("I love coffee") - -coffee_hash = nlp.vocab.strings["coffee"] # 3197928453018144401 -coffee_text = nlp.vocab.strings[coffee_hash] # 'coffee' -print(coffee_hash, coffee_text) -print(doc[2].orth, coffee_hash) # 3197928453018144401 -print(doc[2].text, coffee_text) # 'coffee' - -beer_hash = doc.vocab.strings.add("beer") # 3073001599257881079 -beer_text = doc.vocab.strings[beer_hash] # 'beer' -print(beer_hash, beer_text) - -unicorn_hash = doc.vocab.strings.add("🦄") # 18234233413267120783 -unicorn_text = doc.vocab.strings[unicorn_hash] # '🦄' -print(unicorn_hash, unicorn_text) -``` - - - -**API:** [`StringStore`](/api/stringstore) **Usage:** -[Vocab, hashes and lexemes 101](/usage/spacy-101#vocab) - - - -### Recognize and update named entities {#lightning-tour-entities model="ner"} - -```python -### {executable="true"} -import spacy -from spacy.tokens import Span - -nlp = spacy.load("en_core_web_sm") -doc = nlp("San Francisco considers banning sidewalk delivery robots") -for ent in doc.ents: - print(ent.text, ent.start_char, ent.end_char, ent.label_) - -doc = nlp("FB is hiring a new VP of global policy") -doc.ents = [Span(doc, 0, 1, label="ORG")] -for ent in doc.ents: - print(ent.text, ent.start_char, ent.end_char, ent.label_) -``` - - - -**Usage:** [Named entity recognition](/usage/linguistic-features#named-entities) - - - -### Train and update neural network models {#lightning-tour-training"} - -```python -import random -import spacy -from spacy.gold import Example - -nlp = spacy.load("en_core_web_sm") -train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})] - -with nlp.select_pipes(enable="ner"): - optimizer = nlp.begin_training() - for i in range(10): - random.shuffle(train_data) - for text, annotations in train_data: - doc = nlp.make_doc(text) - example = Example.from_dict(doc, annotations) - nlp.update([example], sgd=optimizer) -nlp.to_disk("/model") -``` - - - -**API:** [`Language.update`](/api/language#update) **Usage:** -[Training spaCy's statistical models](/usage/training) - - - -### Visualize a dependency parse and named entities in your browser {#lightning-tour-displacy model="parser, ner" new="2"} - -> #### Output -> -> ![displaCy visualization](../images/displacy-small.svg) - -```python -from spacy import displacy - -doc_dep = nlp("This is a sentence.") -displacy.serve(doc_dep, style="dep") - -doc_ent = nlp("When Sebastian Thrun started working on self-driving cars at Google " - "in 2007, few people outside of the company took him seriously.") -displacy.serve(doc_ent, style="ent") -``` - - - -**API:** [`displacy`](/api/top-level#displacy) **Usage:** -[Visualizers](/usage/visualizers) - - - -### Get word vectors and similarity {#lightning-tour-word-vectors model="vectors"} - -```python -### {executable="true"} -import spacy - -nlp = spacy.load("en_core_web_md") -doc = nlp("Apple and banana are similar. Pasta and hippo aren't.") - -apple = doc[0] -banana = doc[2] -pasta = doc[6] -hippo = doc[8] - -print("apple <-> banana", apple.similarity(banana)) -print("pasta <-> hippo", pasta.similarity(hippo)) -print(apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector) -``` - -For the best results, you should run this example using the -[`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg) model (currently -not available in the live demo). - - - -**Usage:** [Word vectors and similarity](/usage/vectors-embeddings) - - - -### Simple and efficient serialization {#lightning-tour-serialization} - -```python -import spacy -from spacy.tokens import Doc -from spacy.vocab import Vocab - -nlp = spacy.load("en_core_web_sm") -customer_feedback = open("customer_feedback_627.txt").read() -doc = nlp(customer_feedback) -doc.to_disk("/tmp/customer_feedback_627.bin") - -new_doc = Doc(Vocab()).from_disk("/tmp/customer_feedback_627.bin") -``` - - - -**API:** [`Language`](/api/language), [`Doc`](/api/doc) **Usage:** -[Saving and loading models](/usage/saving-loading#models) - - - -### Match text with token rules {#lightning-tour-rule-matcher} - -```python -### {executable="true"} -import spacy -from spacy.matcher import Matcher - -nlp = spacy.load("en_core_web_sm") -matcher = Matcher(nlp.vocab) - -def set_sentiment(matcher, doc, i, matches): - doc.sentiment += 0.1 - -pattern1 = [[{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]] -patterns = [[{"ORTH": emoji, "OP": "+"}] for emoji in ["😀", "😂", "🤣", "😍"]] -matcher.add("GoogleIO", patterns1) # Match "Google I/O" or "Google i/o" -matcher.add("HAPPY", patterns2, on_match=set_sentiment) # Match one or more happy emoji - -doc = nlp("A text about Google I/O 😀😀") -matches = matcher(doc) - -for match_id, start, end in matches: - string_id = nlp.vocab.strings[match_id] - span = doc[start:end] - print(string_id, span.text) -print("Sentiment", doc.sentiment) -``` - - - -**API:** [`Matcher`](/api/matcher) **Usage:** -[Rule-based matching](/usage/rule-based-matching) - - - -### Minibatched stream processing {#lightning-tour-minibatched} - -```python -texts = ["One document.", "...", "Lots of documents"] -# .pipe streams input, and produces streaming output -iter_texts = (texts[i % 3] for i in range(100000000)) -for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50)): - assert doc.is_parsed - if i == 100: - break -``` - -### Get syntactic dependencies {#lightning-tour-dependencies model="parser"} - -```python -### {executable="true"} -import spacy - -nlp = spacy.load("en_core_web_sm") -doc = nlp("When Sebastian Thrun started working on self-driving cars at Google " - "in 2007, few people outside of the company took him seriously.") - -dep_labels = [] -for token in doc: - while token.head != token: - dep_labels.append(token.dep_) - token = token.head -print(dep_labels) -``` - - - -**API:** [`Token`](/api/token) **Usage:** -[Using the dependency parse](/usage/linguistic-features#dependency-parse) - - - -### Export to numpy arrays {#lightning-tour-numpy-arrays} - -```python -### {executable="true"} -import spacy -from spacy.attrs import ORTH, LIKE_URL - -nlp = spacy.load("en_core_web_sm") -doc = nlp("Check out https://spacy.io") -for token in doc: - print(token.text, token.orth, token.like_url) - -attr_ids = [ORTH, LIKE_URL] -doc_array = doc.to_array(attr_ids) -print(doc_array.shape) -print(len(doc), len(attr_ids)) - -assert doc[0].orth == doc_array[0, 0] -assert doc[1].orth == doc_array[1, 0] -assert doc[0].like_url == doc_array[0, 1] - -assert list(doc_array[:, 1]) == [t.like_url for t in doc] -print(list(doc_array[:, 1])) -``` - -### Calculate inline markup on original string {#lightning-tour-inline} - -```python -### {executable="true"} -import spacy - -def put_spans_around_tokens(doc): - """Here, we're building a custom "syntax highlighter" for - part-of-speech tags and dependencies. We put each token in a - span element, with the appropriate classes computed. All whitespace is - preserved, outside of the spans. (Of course, HTML will only display - multiple whitespace if enabled – but the point is, no information is lost - and you can calculate what you need, e.g. , etc.) - """ - output = [] - for token in doc: - if token.is_space: - output.append(token.text) - else: - classes = f"pos-{token.pos_} dep-{token.dep_}" - output.append(f'{token.text}{token.whitespace_}') - string = "".join(output) - string = string.replace("\\n", "") - string = string.replace("\\t", " ") - return f"{string}" - - -nlp = spacy.load("en_core_web_sm") -doc = nlp("This is a test.\\n\\nHello world.") -html = put_spans_around_tokens(doc) -print(html) -``` - ## Architecture {#architecture} import Architecture101 from 'usage/101/\_architecture.md'
etc.) - """ - output = [] - for token in doc: - if token.is_space: - output.append(token.text) - else: - classes = f"pos-{token.pos_} dep-{token.dep_}" - output.append(f'{token.text}{token.whitespace_}') - string = "".join(output) - string = string.replace("\\n", "") - string = string.replace("\\t", " ") - return f"
{string}