Merge branch 'master' into spacy.io

2025-08-11 07:34:54 +03:00 · 2019-03-30 20:32:14 +01:00 · 2019-03-30 20:32:14 +01:00 · 9d1221943b
commit 9d1221943b
parent 730f759b4f 037ffdfd3f
21 changed files with 728 additions and 48 deletions
--- a/.github/contributors/HiromuHota.md
+++ b/.github/contributors/HiromuHota.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ ] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Hiromu Hota          |
+| Company name (if applicable)   | Hitachi America, Ltd.|
+| Title or role (if applicable)  | Researcher           |
+| Date                           | 2019-03-25           |
+| GitHub username                | HiromuHota           |
+| Website (optional)             |                      |
--- a/examples/pipeline/dummy_entity_linking.py
+++ b/examples/pipeline/dummy_entity_linking.py
@ -0,0 +1,71 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+"""Demonstrate how to build a simple knowledge base and run an Entity Linking algorithm.
+Currently still a bit of a dummy algorithm: taking simply the entity with highest probability for a given alias
+"""
+import spacy
+from spacy.kb import KnowledgeBase
+
+
+def create_kb(vocab):
+    kb = KnowledgeBase(vocab=vocab)
+
+    # adding entities
+    entity_0 = "Q1004791_Douglas"
+    print("adding entity", entity_0)
+    kb.add_entity(entity=entity_0, prob=0.5)
+
+    entity_1 = "Q42_Douglas_Adams"
+    print("adding entity", entity_1)
+    kb.add_entity(entity=entity_1, prob=0.5)
+
+    entity_2 = "Q5301561_Douglas_Haig"
+    print("adding entity", entity_2)
+    kb.add_entity(entity=entity_2, prob=0.5)
+
+    # adding aliases
+    print()
+    alias_0 = "Douglas"
+    print("adding alias", alias_0)
+    kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.1, 0.6, 0.2])
+
+    alias_1 = "Douglas Adams"
+    print("adding alias", alias_1)
+    kb.add_alias(alias=alias_1, entities=[entity_1], probabilities=[0.9])
+
+    print()
+    print("kb size:", len(kb), kb.get_size_entities(), kb.get_size_aliases())
+
+    return kb
+
+
+def add_el(kb, nlp):
+    el_pipe = nlp.create_pipe(name='entity_linker', config={"kb": kb})
+    nlp.add_pipe(el_pipe, last=True)
+
+    for alias in ["Douglas Adams", "Douglas"]:
+        candidates = nlp.linker.kb.get_candidates(alias)
+        print()
+        print(len(candidates), "candidate(s) for", alias, ":")
+        for c in candidates:
+            print(" ", c.entity_, c.prior_prob)
+
+    text = "In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, " \
+           "Douglas reminds us to always bring our towel. " \
+           "The main character in Doug's novel is called Arthur Dent."
+    doc = nlp(text)
+
+    print()
+    for token in doc:
+        print("token", token.text, token.ent_type_, token.ent_kb_id_)
+
+    print()
+    for ent in doc.ents:
+        print("ent", ent.text, ent.label_, ent.kb_id_)
+
+
+if __name__ == "__main__":
+    nlp = spacy.load('en_core_web_sm')
+    my_kb = create_kb(nlp.vocab)
+    add_el(my_kb, nlp)
--- a/setup.py
+++ b/setup.py
@ -40,6 +40,7 @@ MOD_NAMES = [
    "spacy.lexeme",
    "spacy.vocab",
    "spacy.attrs",
+    "spacy.kb",
    "spacy.morphology",
    "spacy.pipeline.pipes",
    "spacy.syntax.stateclass",
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -80,6 +80,8 @@ class Warnings(object):
            "the v2.x models cannot release the global interpreter lock. "
            "Future versions may introduce a `n_process` argument for "
            "parallel inference via multiprocessing.")
+    W017 = ("Alias '{alias}' already exists in the Knowledge base.")
+    W018 = ("Entity '{entity}' already exists in the Knowledge base.")


@add_codes
@ -371,6 +373,16 @@ class Errors(object):
            "with spacy >= 2.1.0. To fix this, reinstall Python and use a wide "
            "unicode build instead. You can also rebuild Python and set the "
            "--enable-unicode=ucs4 flag.")
+    E131 = ("Cannot write the kb_id of an existing Span object because a Span "
+            "is a read-only view of the underlying Token objects stored in the Doc. "
+            "Instead, create a new Span object and specify the `kb_id` keyword argument, "
+            "for example:\nfrom spacy.tokens import Span\n"
+            "span = Span(doc, start={start}, end={end}, label='{label}', kb_id='{kb_id}')")
+    E132 = ("The vectors for entities and probabilities for alias '{alias}' should have equal length, "
+            "but found {entities_length} and {probabilities_length} respectively.")
+    E133 = ("The sum of prior probabilities for alias '{alias}' should not exceed 1, "
+            "but found {sum}.")
+    E134 = ("Alias '{alias}' defined for unknown entity '{entity}'.")


@add_codes
--- a/spacy/kb.pxd
+++ b/spacy/kb.pxd
@ -0,0 +1,148 @@
+"""Knowledge-base for entity or concept linking."""
+from cymem.cymem cimport Pool
+from preshed.maps cimport PreshMap
+from libcpp.vector cimport vector
+from libc.stdint cimport int32_t, int64_t
+
+from spacy.vocab cimport Vocab
+from .typedefs cimport hash_t
+
+
+# Internal struct, for storage and disambiguation. This isn't what we return
+# to the user as the answer to "here's your entity". It's the minimum number
+# of bits we need to keep track of the answers.
+cdef struct _EntryC:
+
+    # The hash of this entry's unique ID and name in the kB
+    hash_t entity_hash
+
+    # Allows retrieval of one or more vectors.
+    # Each element of vector_rows should be an index into a vectors table.
+    # Every entry should have the same number of vectors, so we can avoid storing
+    # the number of vectors in each knowledge-base struct
+    int32_t* vector_rows
+
+    # Allows retrieval of a struct of non-vector features. We could make this a
+    # pointer, but we have 32 bits left over in the struct after prob, so we'd
+    # like this to only be 32 bits. We can also set this to -1, for the common
+    # case where there are no features.
+    int32_t feats_row
+
+    # log probability of entity, based on corpus frequency
+    float prob
+
+
+# Each alias struct stores a list of Entry pointers with their prior probabilities
+# for this specific mention/alias.
+cdef struct _AliasC:
+
+    # All entry candidates for this alias
+    vector[int64_t] entry_indices
+
+    # Prior probability P(entity|alias) - should sum up to (at most) 1.
+    vector[float] probs
+
+
+# Object used by the Entity Linker that summarizes one entity-alias candidate combination.
+cdef class Candidate:
+
+    cdef readonly KnowledgeBase kb
+    cdef hash_t entity_hash
+    cdef hash_t alias_hash
+    cdef float prior_prob
+
+
+cdef class KnowledgeBase:
+    cdef Pool mem
+    cpdef readonly Vocab vocab
+
+    # This maps 64bit keys (hash of unique entity string)
+    # to 64bit values (position of the _EntryC struct in the _entries vector).
+    # The PreshMap is pretty space efficient, as it uses open addressing. So
+    # the only overhead is the vacancy rate, which is approximately 30%.
+    cdef PreshMap _entry_index
+
+    # Each entry takes 128 bits, and again we'll have a 30% or so overhead for
+    # over allocation.
+    # In total we end up with (N*128*1.3)+(N*128*1.3) bits for N entries.
+    # Storing 1m entries would take 41.6mb under this scheme.
+    cdef vector[_EntryC] _entries
+
+    # This maps 64bit keys (hash of unique alias string)
+    # to 64bit values (position of the _AliasC struct in the _aliases_table vector).
+    cdef PreshMap _alias_index
+
+    # This should map mention hashes to (entry_id, prob) tuples. The probability
+    # should be P(entity | mention), which is pretty important to know.
+    # We can pack both pieces of information into a 64-bit value, to keep things
+    # efficient.
+    cdef vector[_AliasC] _aliases_table
+
+    # This is the part which might take more space: storing various
+    # categorical features for the entries, and storing vectors for disambiguation
+    # and possibly usage.
+    # If each entry gets a 300-dimensional vector, for 1m entries we would need
+    # 1.2gb. That gets expensive fast. What might be better is to avoid learning
+    # a unique vector for every entity. We could instead have a compositional
+    # model, that embeds different features of the entities into vectors. We'll
+    # still want some per-entity features, like the Wikipedia text or entity
+    # co-occurrence. Hopefully those vectors can be narrow, e.g. 64 dimensions.
+    cdef object _vectors_table
+
+    # It's very useful to track categorical features, at least for output, even
+    # if they're not useful in the model itself. For instance, we should be
+    # able to track stuff like a person's date of birth or whatever. This can
+    # easily make the KB bigger, but if this isn't needed by the model, and it's
+    # optional data, we can let users configure a DB as the backend for this.
+    cdef object _features_table
+
+    cdef inline int64_t c_add_entity(self, hash_t entity_hash, float prob,
+                                     int32_t* vector_rows, int feats_row):
+        """Add an entry to the knowledge base."""
+        # This is what we'll map the hash key to. It's where the entry will sit
+        # in the vector of entries, so we can get it later.
+        cdef int64_t new_index = self._entries.size()
+        self._entries.push_back(
+            _EntryC(
+                entity_hash=entity_hash,
+                vector_rows=vector_rows,
+                feats_row=feats_row,
+                prob=prob
+            ))
+        self._entry_index[entity_hash] = new_index
+        return new_index
+
+    cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs):
+        """Connect a mention to a list of potential entities with their prior probabilities ."""
+        cdef int64_t new_index = self._aliases_table.size()
+
+        self._aliases_table.push_back(
+            _AliasC(
+                entry_indices=entry_indices,
+                probs=probs
+            ))
+        self._alias_index[alias_hash] = new_index
+        return new_index
+
+    cdef inline _create_empty_vectors(self):
+        """ 
+        Making sure the first element of each vector is a dummy,
+        because the PreshMap maps pointing to indices in these vectors can not contain 0 as value
+        cf. https://github.com/explosion/preshed/issues/17
+        """
+        cdef int32_t dummy_value = 0
+        self.vocab.strings.add("")
+        self._entries.push_back(
+            _EntryC(
+                entity_hash=self.vocab.strings[""],
+                vector_rows=&dummy_value,
+                feats_row=dummy_value,
+                prob=dummy_value
+            ))
+        self._aliases_table.push_back(
+            _AliasC(
+                entry_indices=[dummy_value],
+                probs=[dummy_value]
+            ))
+
+
--- a/spacy/kb.pyx
+++ b/spacy/kb.pyx
@ -0,0 +1,131 @@
+# cython: profile=True
+# coding: utf8
+from spacy.errors import Errors, Warnings, user_warning
+
+
+cdef class Candidate:
+
+    def __init__(self, KnowledgeBase kb, entity_hash, alias_hash, prior_prob):
+        self.kb = kb
+        self.entity_hash = entity_hash
+        self.alias_hash = alias_hash
+        self.prior_prob = prior_prob
+
+    @property
+    def entity(self):
+        """RETURNS (uint64): hash of the entity's KB ID/name"""
+        return self.entity_hash
+
+    @property
+    def entity_(self):
+        """RETURNS (unicode): ID/name of this entity in the KB"""
+        return self.kb.vocab.strings[self.entity]
+
+    @property
+    def alias(self):
+        """RETURNS (uint64): hash of the alias"""
+        return self.alias_hash
+
+    @property
+    def alias_(self):
+        """RETURNS (unicode): ID of the original alias"""
+        return self.kb.vocab.strings[self.alias]
+
+    @property
+    def prior_prob(self):
+        return self.prior_prob
+
+
+cdef class KnowledgeBase:
+
+    def __init__(self, Vocab vocab):
+        self.vocab = vocab
+        self._entry_index = PreshMap()
+        self._alias_index = PreshMap()
+        self.mem = Pool()
+        self._create_empty_vectors()
+
+    def __len__(self):
+        return self.get_size_entities()
+
+    def get_size_entities(self):
+        return self._entries.size() - 1  # not counting dummy element on index 0
+
+    def get_size_aliases(self):
+        return self._aliases_table.size() - 1 # not counting dummy element on index 0
+
+    def add_entity(self, unicode entity, float prob=0.5, vectors=None, features=None):
+        """
+        Add an entity to the KB.
+        Return the hash of the entity ID at the end
+        """
+        cdef hash_t entity_hash = self.vocab.strings.add(entity)
+
+        # Return if this entity was added before
+        if entity_hash in self._entry_index:
+            user_warning(Warnings.W018.format(entity=entity))
+            return
+
+        cdef int32_t dummy_value = 342
+        self.c_add_entity(entity_hash=entity_hash, prob=prob,
+                          vector_rows=&dummy_value, feats_row=dummy_value)
+        # TODO self._vectors_table.get_pointer(vectors),
+        # self._features_table.get(features))
+
+        return entity_hash
+
+    def add_alias(self, unicode alias, entities, probabilities):
+        """
+        For a given alias, add its potential entities and prior probabilies to the KB.
+        Return the alias_hash at the end
+        """
+
+        # Throw an error if the length of entities and probabilities are not the same
+        if not len(entities) == len(probabilities):
+            raise ValueError(Errors.E132.format(alias=alias,
+                                                entities_length=len(entities),
+                                                probabilities_length=len(probabilities)))
+
+        # Throw an error if the probabilities sum up to more than 1
+        prob_sum = sum(probabilities)
+        if prob_sum > 1:
+            raise ValueError(Errors.E133.format(alias=alias, sum=prob_sum))
+
+        cdef hash_t alias_hash = self.vocab.strings.add(alias)
+
+        # Return if this alias was added before
+        if alias_hash in self._alias_index:
+            user_warning(Warnings.W017.format(alias=alias))
+            return
+
+        cdef hash_t entity_hash
+
+        cdef vector[int64_t] entry_indices
+        cdef vector[float] probs
+
+        for entity, prob in zip(entities, probabilities):
+            entity_hash = self.vocab.strings[entity]
+            if not entity_hash in self._entry_index:
+                raise ValueError(Errors.E134.format(alias=alias, entity=entity))
+
+            entry_index = <int64_t>self._entry_index.get(entity_hash)
+            entry_indices.push_back(int(entry_index))
+            probs.push_back(float(prob))
+
+        self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs)
+
+        return alias_hash
+
+
+    def get_candidates(self, unicode alias):
+        """ TODO: where to put this functionality ?"""
+        cdef hash_t alias_hash = self.vocab.strings[alias]
+        alias_index = <int64_t>self._alias_index.get(alias_hash)
+        alias_entry = self._aliases_table[alias_index]
+
+        return [Candidate(kb=self,
+                          entity_hash=self._entries[entry_index].entity_hash,
+                          alias_hash=alias_hash,
+                          prior_prob=prob)
+                for (entry_index, prob) in zip(alias_entry.entry_indices, alias_entry.probs)
+                if entry_index != 0]
--- a/spacy/language.py
+++ b/spacy/language.py
@ -14,7 +14,7 @@ import srsly
 from .tokenizer import Tokenizer
 from .vocab import Vocab
 from .lemmatizer import Lemmatizer
-from .pipeline import DependencyParser, Tensorizer, Tagger, EntityRecognizer
+from .pipeline import DependencyParser, Tensorizer, Tagger, EntityRecognizer, EntityLinker
 from .pipeline import SimilarityHook, TextCategorizer, Sentencizer
 from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens
 from .pipeline import EntityRuler
@ -117,6 +117,7 @@ class Language(object):
        "tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
        "parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
        "ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
+        "entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg),
        "similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
        "textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
        "sentencizer": lambda nlp, **cfg: Sentencizer(**cfg),
@ -212,6 +213,10 @@ class Language(object):
    def entity(self):
        return self.get_pipe("ner")

+    @property
+    def linker(self):
+        return self.get_pipe("entity_linker")
+
    @property
    def matcher(self):
        return self.get_pipe("matcher")
--- a/spacy/pipeline/init.py
+++ b/spacy/pipeline/init.py
@ -1,7 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals

-from .pipes import Tagger, DependencyParser, EntityRecognizer
+from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker
 from .pipes import TextCategorizer, Tensorizer, Pipe, Sentencizer
 from .entityruler import EntityRuler
 from .hooks import SentenceSegmenter, SimilarityHook
@ -11,6 +11,7 @@ __all__ = [
    "Tagger",
    "DependencyParser",
    "EntityRecognizer",
+    "EntityLinker",
    "TextCategorizer",
    "Tensorizer",
    "Pipe",
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -1061,6 +1061,55 @@ cdef class EntityRecognizer(Parser):
                if move[0] in ("B", "I", "L", "U")))


+class EntityLinker(Pipe):
+    name = 'entity_linker'
+
+    @classmethod
+    def Model(cls, nr_class=1, **cfg):
+        # TODO: non-dummy EL implementation
+        return None
+
+    def __init__(self, model=True, **cfg):
+        self.model = False
+        self.cfg = dict(cfg)
+        self.kb = self.cfg["kb"]
+
+    def __call__(self, doc):
+        self.set_annotations([doc], scores=None, tensors=None)
+        return doc
+
+    def pipe(self, stream, batch_size=128, n_threads=-1):
+        """Apply the pipe to a stream of documents.
+        Both __call__ and pipe should delegate to the `predict()`
+        and `set_annotations()` methods.
+        """
+        for docs in util.minibatch(stream, size=batch_size):
+            docs = list(docs)
+            self.set_annotations(docs, scores=None, tensors=None)
+            yield from docs
+
+    def set_annotations(self, docs, scores, tensors=None):
+        """
+        Currently implemented as taking the KB entry with highest prior probability for each named entity
+        TODO: actually use context etc
+        """
+        for i, doc in enumerate(docs):
+            for ent in doc.ents:
+                candidates = self.kb.get_candidates(ent.text)
+                if candidates:
+                    best_candidate = max(candidates, key=lambda c: c.prior_prob)
+                    for token in ent:
+                        token.ent_kb_id_ = best_candidate.entity_
+
+    def get_loss(self, docs, golds, scores):
+        # TODO
+        pass
+
+    def add_label(self, label):
+        # TODO
+        pass
+
+
 class Sentencizer(object):
    """Segment the Doc into sentences using a rule-based strategy.

@ -1146,5 +1195,5 @@ class Sentencizer(object):
        self.punct_chars = cfg.get("punct_chars", self.default_punct_chars)
        return self

-
-__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "Sentencizer"]
+      
+__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"]
--- a/spacy/structs.pxd
+++ b/spacy/structs.pxd
@ -70,4 +70,5 @@ cdef struct TokenC:
    int sent_start
    int ent_iob
    attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
+    attr_t ent_kb_id
    hash_t ent_id
--- a/spacy/tests/doc/test_span.py
+++ b/spacy/tests/doc/test_span.py
@ -172,10 +172,12 @@ def test_span_as_doc(doc):
    assert span_doc[0].idx == 0


-def test_span_string_label(doc):
-    span = Span(doc, 0, 1, label="hello")
+def test_span_string_label_kb_id(doc):
+    span = Span(doc, 0, 1, label="hello", kb_id="Q342")
    assert span.label_ == "hello"
    assert span.label == doc.vocab.strings["hello"]
+    assert span.kb_id_ == "Q342"
+    assert span.kb_id == doc.vocab.strings["Q342"]


 def test_span_label_readonly(doc):
@ -184,6 +186,12 @@ def test_span_label_readonly(doc):
        span.label_ = "hello"


+def test_span_kb_id_readonly(doc):
+    span = Span(doc, 0, 1)
+    with pytest.raises(NotImplementedError):
+        span.kb_id_ = "Q342"
+
+
 def test_span_ents_property(doc):
    """Test span.ents for the """
    doc.ents = [
--- a/spacy/tests/lang/ja/test_tokenizer.py
+++ b/spacy/tests/lang/ja/test_tokenizer.py
@ -14,11 +14,11 @@ TOKENIZER_TESTS = [
 ]

 TAG_TESTS = [
-    ("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']),
-    ("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']),
-    ("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']),
-    ("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']),
-    ("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能'])
+    ("日本語だよ", ['名詞,固有名詞,地名,国', '名詞,普通名詞,一般,*', '助動詞,*,*,*', '助詞,終助詞,*,*']),
+    ("東京タワーの近くに住んでいます。", ['名詞,固有名詞,地名,一般', '名詞,普通名詞,一般,*', '助詞,格助詞,*,*', '名詞,普通名詞,副詞可能,*', '助詞,格助詞,*,*', '動詞,一般,*,*', '助詞,接続助詞,*,*', '動詞,非自立可能,*,*', '助動詞,*,*,*', '補助記号,句点,*,*']),
+    ("吾輩は猫である。", ['代名詞,*,*,*', '助詞,係助詞,*,*', '名詞,普通名詞,一般,*', '助動詞,*,*,*', '動詞,非自立可能,*,*', '補助記号,句点,*,*']),
+    ("月に代わって、お仕置きよ!", ['名詞,普通名詞,助数詞可能,*', '助詞,格助詞,*,*', '動詞,一般,*,*', '助詞,接続助詞,*,*', '補助記号,読点,*,*', '接頭辞,*,*,*', '名詞,普通名詞,一般,*', '助詞,終助詞,*,*', '補助記号,句点,*,*']),
+    ("すもももももももものうち", ['名詞,普通名詞,一般,*', '助詞,係助詞,*,*', '名詞,普通名詞,一般,*', '助詞,係助詞,*,*', '名詞,普通名詞,一般,*', '助詞,格助詞,*,*', '名詞,普通名詞,副詞可能,*'])
 ]

 POS_TESTS = [
--- a/spacy/tests/pipeline/test_el.py
+++ b/spacy/tests/pipeline/test_el.py
@ -0,0 +1,91 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+from spacy.kb import KnowledgeBase
+from spacy.lang.en import English
+
+
+@pytest.fixture
+def nlp():
+    return English()
+
+
+def test_kb_valid_entities(nlp):
+    """Test the valid construction of a KB with 3 entities and two aliases"""
+    mykb = KnowledgeBase(nlp.vocab)
+
+    # adding entities
+    mykb.add_entity(entity=u'Q1', prob=0.9)
+    mykb.add_entity(entity=u'Q2')
+    mykb.add_entity(entity=u'Q3', prob=0.5)
+
+    # adding aliases
+    mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
+    mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
+
+    # test the size of the corresponding KB
+    assert(mykb.get_size_entities() == 3)
+    assert(mykb.get_size_aliases() == 2)
+
+
+def test_kb_invalid_entities(nlp):
+    """Test the invalid construction of a KB with an alias linked to a non-existing entity"""
+    mykb = KnowledgeBase(nlp.vocab)
+
+    # adding entities
+    mykb.add_entity(entity=u'Q1', prob=0.9)
+    mykb.add_entity(entity=u'Q2', prob=0.2)
+    mykb.add_entity(entity=u'Q3', prob=0.5)
+
+    # adding aliases - should fail because one of the given IDs is not valid
+    with pytest.raises(ValueError):
+        mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q342'], probabilities=[0.8, 0.2])
+
+
+def test_kb_invalid_probabilities(nlp):
+    """Test the invalid construction of a KB with wrong prior probabilities"""
+    mykb = KnowledgeBase(nlp.vocab)
+
+    # adding entities
+    mykb.add_entity(entity=u'Q1', prob=0.9)
+    mykb.add_entity(entity=u'Q2', prob=0.2)
+    mykb.add_entity(entity=u'Q3', prob=0.5)
+
+    # adding aliases - should fail because the sum of the probabilities exceeds 1
+    with pytest.raises(ValueError):
+        mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.4])
+
+
+def test_kb_invalid_combination(nlp):
+    """Test the invalid construction of a KB with non-matching entity and probability lists"""
+    mykb = KnowledgeBase(nlp.vocab)
+
+    # adding entities
+    mykb.add_entity(entity=u'Q1', prob=0.9)
+    mykb.add_entity(entity=u'Q2', prob=0.2)
+    mykb.add_entity(entity=u'Q3', prob=0.5)
+
+    # adding aliases - should fail because the entities and probabilities vectors are not of equal length
+    with pytest.raises(ValueError):
+        mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.3, 0.4, 0.1])
+
+
+def test_candidate_generation(nlp):
+    """Test correct candidate generation"""
+    mykb = KnowledgeBase(nlp.vocab)
+
+    # adding entities
+    mykb.add_entity(entity=u'Q1', prob=0.9)
+    mykb.add_entity(entity=u'Q2', prob=0.2)
+    mykb.add_entity(entity=u'Q3', prob=0.5)
+
+    # adding aliases
+    mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
+    mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
+
+    # test the size of the relevant candidates
+    assert(len(mykb.get_candidates(u'douglas')) == 2)
+    assert(len(mykb.get_candidates(u'adam')) == 1)
+    assert(len(mykb.get_candidates(u'shrubbery')) == 0)
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -326,7 +326,7 @@ cdef class Doc:
    def doc(self):
        return self

-    def char_span(self, int start_idx, int end_idx, label=0, vector=None):
+    def char_span(self, int start_idx, int end_idx, label=0, kb_id=0, vector=None):
        """Create a `Span` object from the slice `doc.text[start : end]`.

        doc (Doc): The parent document.
@ -334,6 +334,7 @@ cdef class Doc:
        end (int): The index of the first character after the span.
        label (uint64 or string): A label to attach to the Span, e.g. for
            named entities.
+        kb_id (uint64 or string):  An ID from a KB to capture the meaning of a named entity.
        vector (ndarray[ndim=1, dtype='float32']): A meaning representation of
            the span.
        RETURNS (Span): The newly constructed object.
@ -342,6 +343,8 @@ cdef class Doc:
        """
        if not isinstance(label, int):
            label = self.vocab.strings.add(label)
+        if not isinstance(kb_id, int):
+            kb_id = self.vocab.strings.add(kb_id)
        cdef int start = token_by_start(self.c, self.length, start_idx)
        if start == -1:
            return None
@ -350,7 +353,7 @@ cdef class Doc:
            return None
        # Currently we have the token index, we want the range-end index
        end += 1
-        cdef Span span = Span(self, start, end, label=label, vector=vector)
+        cdef Span span = Span(self, start, end, label=label, kb_id=kb_id, vector=vector)
        return span

    def similarity(self, other):
@ -484,6 +487,7 @@ cdef class Doc:
            cdef const TokenC* token
            cdef int start = -1
            cdef attr_t label = 0
+            cdef attr_t kb_id = 0
            output = []
            for i in range(self.length):
                token = &self.c[i]
@ -493,16 +497,18 @@ cdef class Doc:
                        raise ValueError(Errors.E093.format(seq=" ".join(seq)))
                elif token.ent_iob == 2 or token.ent_iob == 0:
                    if start != -1:
-                        output.append(Span(self, start, i, label=label))
+                        output.append(Span(self, start, i, label=label, kb_id=kb_id))
                    start = -1
                    label = 0
+                    kb_id = 0
                elif token.ent_iob == 3:
                    if start != -1:
-                        output.append(Span(self, start, i, label=label))
+                        output.append(Span(self, start, i, label=label, kb_id=kb_id))
                    start = i
                    label = token.ent_type
+                    kb_id = token.ent_kb_id
            if start != -1:
-                output.append(Span(self, start, self.length, label=label))
+                output.append(Span(self, start, self.length, label=label, kb_id=kb_id))
            return tuple(output)

        def __set__(self, ents):
--- a/spacy/tokens/span.pxd
+++ b/spacy/tokens/span.pxd
@ -11,6 +11,7 @@ cdef class Span:
    cdef readonly int start_char
    cdef readonly int end_char
    cdef readonly attr_t label
+    cdef readonly attr_t kb_id

    cdef public _vector
    cdef public _vector_norm
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@ -85,13 +85,14 @@ cdef class Span:
        return Underscore.span_extensions.pop(name)

    def __cinit__(self, Doc doc, int start, int end, label=0, vector=None,
-                  vector_norm=None):
+                  vector_norm=None, kb_id=0):
        """Create a `Span` object from the slice `doc[start : end]`.

        doc (Doc): The parent document.
        start (int): The index of the first token of the span.
        end (int): The index of the first token after the span.
        label (uint64): A label to attach to the Span, e.g. for named entities.
+        kb_id (uint64): An identifier from a Knowledge Base to capture the meaning of a named entity.
        vector (ndarray[ndim=1, dtype='float32']): A meaning representation
            of the span.
        RETURNS (Span): The newly constructed object.
@ -110,11 +111,14 @@ cdef class Span:
            self.end_char = 0
        if isinstance(label, basestring_):
            label = doc.vocab.strings.add(label)
+        if isinstance(kb_id, basestring_):
+            kb_id = doc.vocab.strings.add(kb_id)
        if label not in doc.vocab.strings:
            raise ValueError(Errors.E084.format(label=label))
        self.label = label
        self._vector = vector
        self._vector_norm = vector_norm
+        self.kb_id = kb_id

    def __richcmp__(self, Span other, int op):
        if other is None:
@ -655,6 +659,20 @@ cdef class Span:
                label_ = ''
            raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_))

+    property kb_id_:
+        """RETURNS (unicode): The named entity's KB ID."""
+        def __get__(self):
+            return self.doc.vocab.strings[self.kb_id]
+
+        def __set__(self, unicode kb_id_):
+            if not kb_id_:
+                kb_id_ = ''
+            current_label = self.label_
+            if not current_label:
+                current_label = ''
+            raise NotImplementedError(Errors.E131.format(start=self.start, end=self.end,
+                                                         label=current_label, kb_id=kb_id_))
+

 cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
    # Don't allow spaces to be the root, if there are
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -770,6 +770,22 @@ cdef class Token:
        def __set__(self, name):
            self.c.ent_id = self.vocab.strings.add(name)

+    property ent_kb_id:
+        """RETURNS (uint64): Named entity KB ID."""
+        def __get__(self):
+            return self.c.ent_kb_id
+
+        def __set__(self, attr_t ent_kb_id):
+            self.c.ent_kb_id = ent_kb_id
+
+    property ent_kb_id_:
+        """RETURNS (unicode): Named entity KB ID."""
+        def __get__(self):
+            return self.vocab.strings[self.c.ent_kb_id]
+
+        def __set__(self, ent_kb_id):
+            self.c.ent_kb_id = self.vocab.strings.add(ent_kb_id)
+
    @property
    def whitespace_(self):
        """RETURNS (unicode): The trailing whitespace character, if present."""
--- a/website/src/components/landing.js
+++ b/website/src/components/landing.js
@ -75,14 +75,28 @@ export const LandingBannerGrid = ({ children }) => (
    </Grid>
 )

-export const LandingBanner = ({ title, label, to, button, small, background, color, children }) => {
+export const LandingBanner = ({
+    title,
+    label,
+    to,
+    button,
+    small,
+    background,
+    backgroundImage,
+    color,
+    children,
+}) => {
    const contentClassNames = classNames(classes.bannerContent, {
        [classes.bannerContentSmall]: small,
    })
    const textClassNames = classNames(classes.bannerText, {
        [classes.bannerTextSmall]: small,
    })
-    const style = { '--color-theme': background, '--color-back': color }
+    const style = {
+        '--color-theme': background,
+        '--color-back': color,
+        backgroundImage: backgroundImage ? `url(${backgroundImage})` : null,
+    }
    const Heading = small ? H2 : H1
    return (
        <div className={classes.banner} style={style}>
@ -113,7 +127,7 @@ export const LandingBanner = ({ title, label, to, button, small, background, col

 export const LandingBannerButton = ({ to, small, children }) => (
    <div className={classes.bannerButton}>
-        <Button to={to} variant="tertiary" large={!small}>
+        <Button to={to} variant="tertiary" large={!small} className={classes.bannerButtonElement}>
            {children}
        </Button>
    </div>
--- a/website/src/images/spacy-irl.jpg
+++ b/website/src/images/spacy-irl.jpg
--- a/website/src/styles/landing.module.sass
+++ b/website/src/styles/landing.module.sass
@ -73,6 +73,7 @@
    color: var(--color-back)
    padding: 5rem
    margin-bottom: var(--spacing-md)
+    background-size: cover

 .banner-content
    margin-bottom: 0
@ -100,7 +101,7 @@

 .banner-text-small p
    font-size: 1.35rem
-    margin-bottom: 1rem
+    margin-bottom: 1.5rem

@include breakpoint(min, md)
    .banner-content
@ -134,6 +135,9 @@
    margin-bottom: var(--spacing-sm)
    text-align: right

+.banner-button-element
+    background: var(--color-theme)
+
 .logos
    text-align: center
    padding-bottom: 1rem
--- a/website/src/widgets/landing.js
+++ b/website/src/widgets/landing.js
@ -19,6 +19,7 @@ import { H2 } from '../components/typography'
 import { Ul, Li } from '../components/list'
 import Button from '../components/button'
 import Link from '../components/link'
+import irlBackground from '../images/spacy-irl.jpg'

 import BenchmarksChoi from 'usage/_benchmarks-choi.md'

@ -151,19 +152,21 @@ const Landing = ({ data }) => {

            <LandingBannerGrid>
                <LandingBanner
-                    title="BERT-style language model pretraining and more"
-                    label="New in v2.1"
-                    to="/usage/v2-1"
-                    button="Read more"
+                    title="spaCy IRL 2019: Two days of NLP"
+                    label="Join us in Berlin"
+                    to="https://irl.spacy.io/2019"
+                    button="Get tickets"
+                    background="#ffc194"
+                    backgroundImage={irlBackground}
+                    color="#1a1e23"
                    small
                >
-                    Learn more from small training corpora by initializing your models with{' '}
-                    <strong>knowledge from raw text</strong>. The new pretrain command teaches
-                    spaCy's CNN model to predict words based on their context, producing
-                    representations of words in contexts. If you've seen Google's BERT system or
-                    fast.ai's ULMFiT, spaCy's pretraining is similar – but much more efficient. It's
-                    still experimental, but users are already reporting good results, so give it a
-                    try!
+                    We're pleased to invite the spaCy community and other folks working on Natural
+                    Language Processing to Berlin this summer for a small and intimate event{' '}
+                    <strong>July 5-6, 2019</strong>. The event includes a hands-on training day for
+                    teams using spaCy in production, followed by a one-track conference. We booked a
+                    beautiful venue, hand-picked an awesome lineup of speakers and scheduled plenty
+                    of social time to get to know each other and exchange ideas.
                </LandingBanner>

                <LandingBanner
@ -191,23 +194,17 @@ const Landing = ({ data }) => {
            <LandingLogos title="Featured on" logos={data.logosPublications} />

            <LandingBanner
-                title="Convolutional neural network models"
-                label="New in v2.0"
-                button="Download models"
-                to="/models"
+                title="BERT-style language model pretraining"
+                label="New in v2.1"
+                to="/usage/v2-1"
+                button="Read more"
            >
-                spaCy v2.0 features new neural models for <strong>tagging</strong>,{' '}
-                <strong>parsing</strong> and <strong>entity recognition</strong>. The models have
-                been designed and implemented from scratch specifically for spaCy, to give you an
-                unmatched balance of speed, size and accuracy. A novel bloom embedding strategy with
-                subword features is used to support huge vocabularies in tiny tables. Convolutional
-                layers with residual connections, layer normalization and maxout non-linearity are
-                used, giving much better efficiency than the standard BiLSTM solution. Finally, the
-                parser and NER use an imitation learning objective to deliver accuracy in-line with
-                the latest research systems, even when evaluated from raw text. With these
-                innovations, spaCy v2.0's models are <strong>10× smaller</strong>,{' '}
-                <strong>20% more accurate</strong>, and
-                <strong>even cheaper to run</strong> than the previous generation.
+                Learn more from small training corpora by initializing your models with{' '}
+                <strong>knowledge from raw text</strong>. The new pretrain command teaches spaCy's
+                CNN model to predict words based on their context, producing representations of
+                words in contexts. If you've seen Google's BERT system or fast.ai's ULMFiT, spaCy's
+                pretraining is similar – but much more efficient. It's still experimental, but users
+                are already reporting good results, so give it a try!
            </LandingBanner>

            <LandingGrid cols={2}>