diff --git a/.github/contributors/HiromuHota.md b/.github/contributors/HiromuHota.md new file mode 100644 index 000000000..24dfb1d7b --- /dev/null +++ b/.github/contributors/HiromuHota.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ ] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [x] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Hiromu Hota | +| Company name (if applicable) | Hitachi America, Ltd.| +| Title or role (if applicable) | Researcher | +| Date | 2019-03-25 | +| GitHub username | HiromuHota | +| Website (optional) | | diff --git a/.github/contributors/SamuelLKane.md b/.github/contributors/SamuelLKane.md new file mode 100644 index 000000000..9c74dcd15 --- /dev/null +++ b/.github/contributors/SamuelLKane.md @@ -0,0 +1,104 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Samuel Kane | +| Date | 3/20/19 | +| GitHub username | SamuelLKane | +| Website (optional) | samuel-kane.com | \ No newline at end of file diff --git a/.github/contributors/graus.md b/.github/contributors/graus.md new file mode 100644 index 000000000..3848dfeec --- /dev/null +++ b/.github/contributors/graus.md @@ -0,0 +1,107 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | David Graus | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 28.03.2019 | +| GitHub username | graus | +| Website (optional) | graus.nu | + diff --git a/.github/contributors/wannaphongcom.md b/.github/contributors/wannaphongcom.md new file mode 100644 index 000000000..67aae7063 --- /dev/null +++ b/.github/contributors/wannaphongcom.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Wannaphong Phatthiyaphaibun | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 25-3-2019 | +| GitHub username | wannaphongcom | +| Website (optional) | | diff --git a/examples/pipeline/dummy_entity_linking.py b/examples/pipeline/dummy_entity_linking.py new file mode 100644 index 000000000..88415d040 --- /dev/null +++ b/examples/pipeline/dummy_entity_linking.py @@ -0,0 +1,71 @@ +# coding: utf-8 +from __future__ import unicode_literals + +"""Demonstrate how to build a simple knowledge base and run an Entity Linking algorithm. +Currently still a bit of a dummy algorithm: taking simply the entity with highest probability for a given alias +""" +import spacy +from spacy.kb import KnowledgeBase + + +def create_kb(vocab): + kb = KnowledgeBase(vocab=vocab) + + # adding entities + entity_0 = "Q1004791_Douglas" + print("adding entity", entity_0) + kb.add_entity(entity=entity_0, prob=0.5) + + entity_1 = "Q42_Douglas_Adams" + print("adding entity", entity_1) + kb.add_entity(entity=entity_1, prob=0.5) + + entity_2 = "Q5301561_Douglas_Haig" + print("adding entity", entity_2) + kb.add_entity(entity=entity_2, prob=0.5) + + # adding aliases + print() + alias_0 = "Douglas" + print("adding alias", alias_0) + kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.1, 0.6, 0.2]) + + alias_1 = "Douglas Adams" + print("adding alias", alias_1) + kb.add_alias(alias=alias_1, entities=[entity_1], probabilities=[0.9]) + + print() + print("kb size:", len(kb), kb.get_size_entities(), kb.get_size_aliases()) + + return kb + + +def add_el(kb, nlp): + el_pipe = nlp.create_pipe(name='entity_linker', config={"kb": kb}) + nlp.add_pipe(el_pipe, last=True) + + for alias in ["Douglas Adams", "Douglas"]: + candidates = nlp.linker.kb.get_candidates(alias) + print() + print(len(candidates), "candidate(s) for", alias, ":") + for c in candidates: + print(" ", c.entity_, c.prior_prob) + + text = "In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, " \ + "Douglas reminds us to always bring our towel. " \ + "The main character in Doug's novel is called Arthur Dent." + doc = nlp(text) + + print() + for token in doc: + print("token", token.text, token.ent_type_, token.ent_kb_id_) + + print() + for ent in doc.ents: + print("ent", ent.text, ent.label_, ent.kb_id_) + + +if __name__ == "__main__": + nlp = spacy.load('en_core_web_sm') + my_kb = create_kb(nlp.vocab) + add_el(my_kb, nlp) diff --git a/setup.py b/setup.py index ed030eaf0..23d535058 100755 --- a/setup.py +++ b/setup.py @@ -40,6 +40,7 @@ MOD_NAMES = [ "spacy.lexeme", "spacy.vocab", "spacy.attrs", + "spacy.kb", "spacy.morphology", "spacy.pipeline.pipes", "spacy.syntax.stateclass", diff --git a/spacy/errors.py b/spacy/errors.py index b63c46919..5f964114e 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -80,6 +80,8 @@ class Warnings(object): "the v2.x models cannot release the global interpreter lock. " "Future versions may introduce a `n_process` argument for " "parallel inference via multiprocessing.") + W017 = ("Alias '{alias}' already exists in the Knowledge base.") + W018 = ("Entity '{entity}' already exists in the Knowledge base.") @add_codes @@ -371,6 +373,16 @@ class Errors(object): "with spacy >= 2.1.0. To fix this, reinstall Python and use a wide " "unicode build instead. You can also rebuild Python and set the " "--enable-unicode=ucs4 flag.") + E131 = ("Cannot write the kb_id of an existing Span object because a Span " + "is a read-only view of the underlying Token objects stored in the Doc. " + "Instead, create a new Span object and specify the `kb_id` keyword argument, " + "for example:\nfrom spacy.tokens import Span\n" + "span = Span(doc, start={start}, end={end}, label='{label}', kb_id='{kb_id}')") + E132 = ("The vectors for entities and probabilities for alias '{alias}' should have equal length, " + "but found {entities_length} and {probabilities_length} respectively.") + E133 = ("The sum of prior probabilities for alias '{alias}' should not exceed 1, " + "but found {sum}.") + E134 = ("Alias '{alias}' defined for unknown entity '{entity}'.") @add_codes diff --git a/spacy/kb.pxd b/spacy/kb.pxd new file mode 100644 index 000000000..e34a0a9ba --- /dev/null +++ b/spacy/kb.pxd @@ -0,0 +1,148 @@ +"""Knowledge-base for entity or concept linking.""" +from cymem.cymem cimport Pool +from preshed.maps cimport PreshMap +from libcpp.vector cimport vector +from libc.stdint cimport int32_t, int64_t + +from spacy.vocab cimport Vocab +from .typedefs cimport hash_t + + +# Internal struct, for storage and disambiguation. This isn't what we return +# to the user as the answer to "here's your entity". It's the minimum number +# of bits we need to keep track of the answers. +cdef struct _EntryC: + + # The hash of this entry's unique ID and name in the kB + hash_t entity_hash + + # Allows retrieval of one or more vectors. + # Each element of vector_rows should be an index into a vectors table. + # Every entry should have the same number of vectors, so we can avoid storing + # the number of vectors in each knowledge-base struct + int32_t* vector_rows + + # Allows retrieval of a struct of non-vector features. We could make this a + # pointer, but we have 32 bits left over in the struct after prob, so we'd + # like this to only be 32 bits. We can also set this to -1, for the common + # case where there are no features. + int32_t feats_row + + # log probability of entity, based on corpus frequency + float prob + + +# Each alias struct stores a list of Entry pointers with their prior probabilities +# for this specific mention/alias. +cdef struct _AliasC: + + # All entry candidates for this alias + vector[int64_t] entry_indices + + # Prior probability P(entity|alias) - should sum up to (at most) 1. + vector[float] probs + + +# Object used by the Entity Linker that summarizes one entity-alias candidate combination. +cdef class Candidate: + + cdef readonly KnowledgeBase kb + cdef hash_t entity_hash + cdef hash_t alias_hash + cdef float prior_prob + + +cdef class KnowledgeBase: + cdef Pool mem + cpdef readonly Vocab vocab + + # This maps 64bit keys (hash of unique entity string) + # to 64bit values (position of the _EntryC struct in the _entries vector). + # The PreshMap is pretty space efficient, as it uses open addressing. So + # the only overhead is the vacancy rate, which is approximately 30%. + cdef PreshMap _entry_index + + # Each entry takes 128 bits, and again we'll have a 30% or so overhead for + # over allocation. + # In total we end up with (N*128*1.3)+(N*128*1.3) bits for N entries. + # Storing 1m entries would take 41.6mb under this scheme. + cdef vector[_EntryC] _entries + + # This maps 64bit keys (hash of unique alias string) + # to 64bit values (position of the _AliasC struct in the _aliases_table vector). + cdef PreshMap _alias_index + + # This should map mention hashes to (entry_id, prob) tuples. The probability + # should be P(entity | mention), which is pretty important to know. + # We can pack both pieces of information into a 64-bit value, to keep things + # efficient. + cdef vector[_AliasC] _aliases_table + + # This is the part which might take more space: storing various + # categorical features for the entries, and storing vectors for disambiguation + # and possibly usage. + # If each entry gets a 300-dimensional vector, for 1m entries we would need + # 1.2gb. That gets expensive fast. What might be better is to avoid learning + # a unique vector for every entity. We could instead have a compositional + # model, that embeds different features of the entities into vectors. We'll + # still want some per-entity features, like the Wikipedia text or entity + # co-occurrence. Hopefully those vectors can be narrow, e.g. 64 dimensions. + cdef object _vectors_table + + # It's very useful to track categorical features, at least for output, even + # if they're not useful in the model itself. For instance, we should be + # able to track stuff like a person's date of birth or whatever. This can + # easily make the KB bigger, but if this isn't needed by the model, and it's + # optional data, we can let users configure a DB as the backend for this. + cdef object _features_table + + cdef inline int64_t c_add_entity(self, hash_t entity_hash, float prob, + int32_t* vector_rows, int feats_row): + """Add an entry to the knowledge base.""" + # This is what we'll map the hash key to. It's where the entry will sit + # in the vector of entries, so we can get it later. + cdef int64_t new_index = self._entries.size() + self._entries.push_back( + _EntryC( + entity_hash=entity_hash, + vector_rows=vector_rows, + feats_row=feats_row, + prob=prob + )) + self._entry_index[entity_hash] = new_index + return new_index + + cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs): + """Connect a mention to a list of potential entities with their prior probabilities .""" + cdef int64_t new_index = self._aliases_table.size() + + self._aliases_table.push_back( + _AliasC( + entry_indices=entry_indices, + probs=probs + )) + self._alias_index[alias_hash] = new_index + return new_index + + cdef inline _create_empty_vectors(self): + """ + Making sure the first element of each vector is a dummy, + because the PreshMap maps pointing to indices in these vectors can not contain 0 as value + cf. https://github.com/explosion/preshed/issues/17 + """ + cdef int32_t dummy_value = 0 + self.vocab.strings.add("") + self._entries.push_back( + _EntryC( + entity_hash=self.vocab.strings[""], + vector_rows=&dummy_value, + feats_row=dummy_value, + prob=dummy_value + )) + self._aliases_table.push_back( + _AliasC( + entry_indices=[dummy_value], + probs=[dummy_value] + )) + + diff --git a/spacy/kb.pyx b/spacy/kb.pyx new file mode 100644 index 000000000..3a0a8b918 --- /dev/null +++ b/spacy/kb.pyx @@ -0,0 +1,131 @@ +# cython: profile=True +# coding: utf8 +from spacy.errors import Errors, Warnings, user_warning + + +cdef class Candidate: + + def __init__(self, KnowledgeBase kb, entity_hash, alias_hash, prior_prob): + self.kb = kb + self.entity_hash = entity_hash + self.alias_hash = alias_hash + self.prior_prob = prior_prob + + @property + def entity(self): + """RETURNS (uint64): hash of the entity's KB ID/name""" + return self.entity_hash + + @property + def entity_(self): + """RETURNS (unicode): ID/name of this entity in the KB""" + return self.kb.vocab.strings[self.entity] + + @property + def alias(self): + """RETURNS (uint64): hash of the alias""" + return self.alias_hash + + @property + def alias_(self): + """RETURNS (unicode): ID of the original alias""" + return self.kb.vocab.strings[self.alias] + + @property + def prior_prob(self): + return self.prior_prob + + +cdef class KnowledgeBase: + + def __init__(self, Vocab vocab): + self.vocab = vocab + self._entry_index = PreshMap() + self._alias_index = PreshMap() + self.mem = Pool() + self._create_empty_vectors() + + def __len__(self): + return self.get_size_entities() + + def get_size_entities(self): + return self._entries.size() - 1 # not counting dummy element on index 0 + + def get_size_aliases(self): + return self._aliases_table.size() - 1 # not counting dummy element on index 0 + + def add_entity(self, unicode entity, float prob=0.5, vectors=None, features=None): + """ + Add an entity to the KB. + Return the hash of the entity ID at the end + """ + cdef hash_t entity_hash = self.vocab.strings.add(entity) + + # Return if this entity was added before + if entity_hash in self._entry_index: + user_warning(Warnings.W018.format(entity=entity)) + return + + cdef int32_t dummy_value = 342 + self.c_add_entity(entity_hash=entity_hash, prob=prob, + vector_rows=&dummy_value, feats_row=dummy_value) + # TODO self._vectors_table.get_pointer(vectors), + # self._features_table.get(features)) + + return entity_hash + + def add_alias(self, unicode alias, entities, probabilities): + """ + For a given alias, add its potential entities and prior probabilies to the KB. + Return the alias_hash at the end + """ + + # Throw an error if the length of entities and probabilities are not the same + if not len(entities) == len(probabilities): + raise ValueError(Errors.E132.format(alias=alias, + entities_length=len(entities), + probabilities_length=len(probabilities))) + + # Throw an error if the probabilities sum up to more than 1 + prob_sum = sum(probabilities) + if prob_sum > 1: + raise ValueError(Errors.E133.format(alias=alias, sum=prob_sum)) + + cdef hash_t alias_hash = self.vocab.strings.add(alias) + + # Return if this alias was added before + if alias_hash in self._alias_index: + user_warning(Warnings.W017.format(alias=alias)) + return + + cdef hash_t entity_hash + + cdef vector[int64_t] entry_indices + cdef vector[float] probs + + for entity, prob in zip(entities, probabilities): + entity_hash = self.vocab.strings[entity] + if not entity_hash in self._entry_index: + raise ValueError(Errors.E134.format(alias=alias, entity=entity)) + + entry_index = self._entry_index.get(entity_hash) + entry_indices.push_back(int(entry_index)) + probs.push_back(float(prob)) + + self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs) + + return alias_hash + + + def get_candidates(self, unicode alias): + """ TODO: where to put this functionality ?""" + cdef hash_t alias_hash = self.vocab.strings[alias] + alias_index = self._alias_index.get(alias_hash) + alias_entry = self._aliases_table[alias_index] + + return [Candidate(kb=self, + entity_hash=self._entries[entry_index].entity_hash, + alias_hash=alias_hash, + prior_prob=prob) + for (entry_index, prob) in zip(alias_entry.entry_indices, alias_entry.probs) + if entry_index != 0] diff --git a/spacy/lang/en/lemmatizer/_adverbs_irreg.py b/spacy/lang/en/lemmatizer/_adverbs_irreg.py index 4f0b479b8..4499b2a13 100644 --- a/spacy/lang/en/lemmatizer/_adverbs_irreg.py +++ b/spacy/lang/en/lemmatizer/_adverbs_irreg.py @@ -5,9 +5,27 @@ from __future__ import unicode_literals ADVERBS_IRREG = { "best": ("well",), "better": ("well",), + "closer": ("close",), + "closest": ("close",), "deeper": ("deeply",), + "earlier": ("early",), + "earliest": ("early",), "farther": ("far",), "further": ("far",), + "faster": ("fast",), + "fastest": ("fast",), "harder": ("hard",), "hardest": ("hard",), + "longer": ("long",), + "longest": ("long",), + "nearer": ("near",), + "nearest": ("near",), + "nigher": ("nigh",), + "nighest": ("nigh",), + "quicker": ("quick",), + "quickest": ("quick",), + "slower": ("slow",), + "slowest": ("slowest",), + "sooner": ("soon",), + "soonest": ("soon",) } diff --git a/spacy/lang/th/tag_map.py b/spacy/lang/th/tag_map.py index 873ba65fc..6515ffe05 100644 --- a/spacy/lang/th/tag_map.py +++ b/spacy/lang/th/tag_map.py @@ -1,7 +1,7 @@ # encoding: utf8 from __future__ import unicode_literals -from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX +from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX,VERB from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ # Source: Korakot Chaovavanich @@ -16,6 +16,9 @@ TAG_MAP = { "CMTR": {POS: NOUN}, "CFQC": {POS: NOUN}, "CVBL": {POS: NOUN}, + # VERB + "VACT":{POS:VERB}, + "VSTA":{POS:VERB}, # PRON "PRON": {POS: PRON}, "NPRP": {POS: PRON}, @@ -78,6 +81,7 @@ TAG_MAP = { "EAFF": {POS: PART}, "AITT": {POS: PART}, "NEG": {POS: PART}, + "EITT": {POS: PART}, # PUNCT "PUNCT": {POS: PUNCT}, "PUNC": {POS: PUNCT}, diff --git a/spacy/language.py b/spacy/language.py index c1642f631..6bd21b0bc 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -14,7 +14,7 @@ import srsly from .tokenizer import Tokenizer from .vocab import Vocab from .lemmatizer import Lemmatizer -from .pipeline import DependencyParser, Tensorizer, Tagger, EntityRecognizer +from .pipeline import DependencyParser, Tensorizer, Tagger, EntityRecognizer, EntityLinker from .pipeline import SimilarityHook, TextCategorizer, Sentencizer from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens from .pipeline import EntityRuler @@ -117,6 +117,7 @@ class Language(object): "tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg), "parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg), "ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg), + "entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg), "similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg), "textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg), "sentencizer": lambda nlp, **cfg: Sentencizer(**cfg), @@ -212,6 +213,10 @@ class Language(object): def entity(self): return self.get_pipe("ner") + @property + def linker(self): + return self.get_pipe("entity_linker") + @property def matcher(self): return self.get_pipe("matcher") diff --git a/spacy/pipeline/__init__.py b/spacy/pipeline/__init__.py index eaadd977d..5d7b079d9 100644 --- a/spacy/pipeline/__init__.py +++ b/spacy/pipeline/__init__.py @@ -1,7 +1,7 @@ # coding: utf8 from __future__ import unicode_literals -from .pipes import Tagger, DependencyParser, EntityRecognizer +from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker from .pipes import TextCategorizer, Tensorizer, Pipe, Sentencizer from .entityruler import EntityRuler from .hooks import SentenceSegmenter, SimilarityHook @@ -11,6 +11,7 @@ __all__ = [ "Tagger", "DependencyParser", "EntityRecognizer", + "EntityLinker", "TextCategorizer", "Tensorizer", "Pipe", diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx index 5e94c2f95..7043c1647 100644 --- a/spacy/pipeline/pipes.pyx +++ b/spacy/pipeline/pipes.pyx @@ -1061,6 +1061,55 @@ cdef class EntityRecognizer(Parser): if move[0] in ("B", "I", "L", "U"))) +class EntityLinker(Pipe): + name = 'entity_linker' + + @classmethod + def Model(cls, nr_class=1, **cfg): + # TODO: non-dummy EL implementation + return None + + def __init__(self, model=True, **cfg): + self.model = False + self.cfg = dict(cfg) + self.kb = self.cfg["kb"] + + def __call__(self, doc): + self.set_annotations([doc], scores=None, tensors=None) + return doc + + def pipe(self, stream, batch_size=128, n_threads=-1): + """Apply the pipe to a stream of documents. + Both __call__ and pipe should delegate to the `predict()` + and `set_annotations()` methods. + """ + for docs in util.minibatch(stream, size=batch_size): + docs = list(docs) + self.set_annotations(docs, scores=None, tensors=None) + yield from docs + + def set_annotations(self, docs, scores, tensors=None): + """ + Currently implemented as taking the KB entry with highest prior probability for each named entity + TODO: actually use context etc + """ + for i, doc in enumerate(docs): + for ent in doc.ents: + candidates = self.kb.get_candidates(ent.text) + if candidates: + best_candidate = max(candidates, key=lambda c: c.prior_prob) + for token in ent: + token.ent_kb_id_ = best_candidate.entity_ + + def get_loss(self, docs, golds, scores): + # TODO + pass + + def add_label(self, label): + # TODO + pass + + class Sentencizer(object): """Segment the Doc into sentences using a rule-based strategy. @@ -1146,5 +1195,5 @@ class Sentencizer(object): self.punct_chars = cfg.get("punct_chars", self.default_punct_chars) return self - -__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "Sentencizer"] + +__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"] diff --git a/spacy/structs.pxd b/spacy/structs.pxd index fa282cae7..154202c0d 100644 --- a/spacy/structs.pxd +++ b/spacy/structs.pxd @@ -70,4 +70,5 @@ cdef struct TokenC: int sent_start int ent_iob attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth.. + attr_t ent_kb_id hash_t ent_id diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py index 087006f26..13f7f2771 100644 --- a/spacy/tests/doc/test_span.py +++ b/spacy/tests/doc/test_span.py @@ -172,10 +172,12 @@ def test_span_as_doc(doc): assert span_doc[0].idx == 0 -def test_span_string_label(doc): - span = Span(doc, 0, 1, label="hello") +def test_span_string_label_kb_id(doc): + span = Span(doc, 0, 1, label="hello", kb_id="Q342") assert span.label_ == "hello" assert span.label == doc.vocab.strings["hello"] + assert span.kb_id_ == "Q342" + assert span.kb_id == doc.vocab.strings["Q342"] def test_span_label_readonly(doc): @@ -184,6 +186,12 @@ def test_span_label_readonly(doc): span.label_ = "hello" +def test_span_kb_id_readonly(doc): + span = Span(doc, 0, 1) + with pytest.raises(NotImplementedError): + span.kb_id_ = "Q342" + + def test_span_ents_property(doc): """Test span.ents for the """ doc.ents = [ diff --git a/spacy/tests/lang/en/test_exceptions.py b/spacy/tests/lang/en/test_exceptions.py index 6285a9408..b360b517e 100644 --- a/spacy/tests/lang/en/test_exceptions.py +++ b/spacy/tests/lang/en/test_exceptions.py @@ -124,3 +124,9 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms): def test_en_lex_attrs_norm_exceptions(en_tokenizer, text, norm): tokens = en_tokenizer(text) assert tokens[0].norm_ == norm + + +@pytest.mark.parametrize("text", ["faster", "fastest", "better", "best"]) +def test_en_lemmatizer_handles_irreg_adverbs(en_tokenizer, text): + tokens = en_tokenizer(text) + assert tokens[0].lemma_ in ["fast", "well"] diff --git a/spacy/tests/lang/ja/test_tokenizer.py b/spacy/tests/lang/ja/test_tokenizer.py index 87a343185..c95e7bc40 100644 --- a/spacy/tests/lang/ja/test_tokenizer.py +++ b/spacy/tests/lang/ja/test_tokenizer.py @@ -14,11 +14,11 @@ TOKENIZER_TESTS = [ ] TAG_TESTS = [ - ("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']), - ("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']), - ("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']), - ("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']), - ("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能']) + ("日本語だよ", ['名詞,固有名詞,地名,国', '名詞,普通名詞,一般,*', '助動詞,*,*,*', '助詞,終助詞,*,*']), + ("東京タワーの近くに住んでいます。", ['名詞,固有名詞,地名,一般', '名詞,普通名詞,一般,*', '助詞,格助詞,*,*', '名詞,普通名詞,副詞可能,*', '助詞,格助詞,*,*', '動詞,一般,*,*', '助詞,接続助詞,*,*', '動詞,非自立可能,*,*', '助動詞,*,*,*', '補助記号,句点,*,*']), + ("吾輩は猫である。", ['代名詞,*,*,*', '助詞,係助詞,*,*', '名詞,普通名詞,一般,*', '助動詞,*,*,*', '動詞,非自立可能,*,*', '補助記号,句点,*,*']), + ("月に代わって、お仕置きよ!", ['名詞,普通名詞,助数詞可能,*', '助詞,格助詞,*,*', '動詞,一般,*,*', '助詞,接続助詞,*,*', '補助記号,読点,*,*', '接頭辞,*,*,*', '名詞,普通名詞,一般,*', '助詞,終助詞,*,*', '補助記号,句点,*,*']), + ("すもももももももものうち", ['名詞,普通名詞,一般,*', '助詞,係助詞,*,*', '名詞,普通名詞,一般,*', '助詞,係助詞,*,*', '名詞,普通名詞,一般,*', '助詞,格助詞,*,*', '名詞,普通名詞,副詞可能,*']) ] POS_TESTS = [ diff --git a/spacy/tests/pipeline/test_el.py b/spacy/tests/pipeline/test_el.py new file mode 100644 index 000000000..61baece68 --- /dev/null +++ b/spacy/tests/pipeline/test_el.py @@ -0,0 +1,91 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + +from spacy.kb import KnowledgeBase +from spacy.lang.en import English + + +@pytest.fixture +def nlp(): + return English() + + +def test_kb_valid_entities(nlp): + """Test the valid construction of a KB with 3 entities and two aliases""" + mykb = KnowledgeBase(nlp.vocab) + + # adding entities + mykb.add_entity(entity=u'Q1', prob=0.9) + mykb.add_entity(entity=u'Q2') + mykb.add_entity(entity=u'Q3', prob=0.5) + + # adding aliases + mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2]) + mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9]) + + # test the size of the corresponding KB + assert(mykb.get_size_entities() == 3) + assert(mykb.get_size_aliases() == 2) + + +def test_kb_invalid_entities(nlp): + """Test the invalid construction of a KB with an alias linked to a non-existing entity""" + mykb = KnowledgeBase(nlp.vocab) + + # adding entities + mykb.add_entity(entity=u'Q1', prob=0.9) + mykb.add_entity(entity=u'Q2', prob=0.2) + mykb.add_entity(entity=u'Q3', prob=0.5) + + # adding aliases - should fail because one of the given IDs is not valid + with pytest.raises(ValueError): + mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q342'], probabilities=[0.8, 0.2]) + + +def test_kb_invalid_probabilities(nlp): + """Test the invalid construction of a KB with wrong prior probabilities""" + mykb = KnowledgeBase(nlp.vocab) + + # adding entities + mykb.add_entity(entity=u'Q1', prob=0.9) + mykb.add_entity(entity=u'Q2', prob=0.2) + mykb.add_entity(entity=u'Q3', prob=0.5) + + # adding aliases - should fail because the sum of the probabilities exceeds 1 + with pytest.raises(ValueError): + mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.4]) + + +def test_kb_invalid_combination(nlp): + """Test the invalid construction of a KB with non-matching entity and probability lists""" + mykb = KnowledgeBase(nlp.vocab) + + # adding entities + mykb.add_entity(entity=u'Q1', prob=0.9) + mykb.add_entity(entity=u'Q2', prob=0.2) + mykb.add_entity(entity=u'Q3', prob=0.5) + + # adding aliases - should fail because the entities and probabilities vectors are not of equal length + with pytest.raises(ValueError): + mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.3, 0.4, 0.1]) + + +def test_candidate_generation(nlp): + """Test correct candidate generation""" + mykb = KnowledgeBase(nlp.vocab) + + # adding entities + mykb.add_entity(entity=u'Q1', prob=0.9) + mykb.add_entity(entity=u'Q2', prob=0.2) + mykb.add_entity(entity=u'Q3', prob=0.5) + + # adding aliases + mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2]) + mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9]) + + # test the size of the relevant candidates + assert(len(mykb.get_candidates(u'douglas')) == 2) + assert(len(mykb.get_candidates(u'adam')) == 1) + assert(len(mykb.get_candidates(u'shrubbery')) == 0) diff --git a/spacy/tests/regression/test_issue3447.py b/spacy/tests/regression/test_issue3447.py new file mode 100644 index 000000000..bfe71669a --- /dev/null +++ b/spacy/tests/regression/test_issue3447.py @@ -0,0 +1,10 @@ +from spacy.util import decaying + +def test_decaying(): + sizes = decaying(10., 1., .5) + size = next(sizes) + assert size == 10. + size = next(sizes) + assert size == 10. - 0.5 + size = next(sizes) + assert size == 10. - 0.5 - 0.5 diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx index e390a72b9..70a693ba1 100644 --- a/spacy/tokenizer.pyx +++ b/spacy/tokenizer.pyx @@ -131,6 +131,7 @@ cdef class Tokenizer: texts: A sequence of unicode texts. batch_size (int): Number of texts to accumulate in an internal buffer. + Defaults to 1000. YIELDS (Doc): A sequence of Doc objects, in order. DOCS: https://spacy.io/api/tokenizer#pipe diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index e433002f2..131c43d37 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -326,7 +326,7 @@ cdef class Doc: def doc(self): return self - def char_span(self, int start_idx, int end_idx, label=0, vector=None): + def char_span(self, int start_idx, int end_idx, label=0, kb_id=0, vector=None): """Create a `Span` object from the slice `doc.text[start : end]`. doc (Doc): The parent document. @@ -334,6 +334,7 @@ cdef class Doc: end (int): The index of the first character after the span. label (uint64 or string): A label to attach to the Span, e.g. for named entities. + kb_id (uint64 or string): An ID from a KB to capture the meaning of a named entity. vector (ndarray[ndim=1, dtype='float32']): A meaning representation of the span. RETURNS (Span): The newly constructed object. @@ -342,6 +343,8 @@ cdef class Doc: """ if not isinstance(label, int): label = self.vocab.strings.add(label) + if not isinstance(kb_id, int): + kb_id = self.vocab.strings.add(kb_id) cdef int start = token_by_start(self.c, self.length, start_idx) if start == -1: return None @@ -350,7 +353,7 @@ cdef class Doc: return None # Currently we have the token index, we want the range-end index end += 1 - cdef Span span = Span(self, start, end, label=label, vector=vector) + cdef Span span = Span(self, start, end, label=label, kb_id=kb_id, vector=vector) return span def similarity(self, other): @@ -484,6 +487,7 @@ cdef class Doc: cdef const TokenC* token cdef int start = -1 cdef attr_t label = 0 + cdef attr_t kb_id = 0 output = [] for i in range(self.length): token = &self.c[i] @@ -493,16 +497,18 @@ cdef class Doc: raise ValueError(Errors.E093.format(seq=" ".join(seq))) elif token.ent_iob == 2 or token.ent_iob == 0: if start != -1: - output.append(Span(self, start, i, label=label)) + output.append(Span(self, start, i, label=label, kb_id=kb_id)) start = -1 label = 0 + kb_id = 0 elif token.ent_iob == 3: if start != -1: - output.append(Span(self, start, i, label=label)) + output.append(Span(self, start, i, label=label, kb_id=kb_id)) start = i label = token.ent_type + kb_id = token.ent_kb_id if start != -1: - output.append(Span(self, start, self.length, label=label)) + output.append(Span(self, start, self.length, label=label, kb_id=kb_id)) return tuple(output) def __set__(self, ents): diff --git a/spacy/tokens/span.pxd b/spacy/tokens/span.pxd index 9645189a5..f6f88a23e 100644 --- a/spacy/tokens/span.pxd +++ b/spacy/tokens/span.pxd @@ -11,6 +11,7 @@ cdef class Span: cdef readonly int start_char cdef readonly int end_char cdef readonly attr_t label + cdef readonly attr_t kb_id cdef public _vector cdef public _vector_norm diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx index e62caed40..97b6a1adc 100644 --- a/spacy/tokens/span.pyx +++ b/spacy/tokens/span.pyx @@ -85,13 +85,14 @@ cdef class Span: return Underscore.span_extensions.pop(name) def __cinit__(self, Doc doc, int start, int end, label=0, vector=None, - vector_norm=None): + vector_norm=None, kb_id=0): """Create a `Span` object from the slice `doc[start : end]`. doc (Doc): The parent document. start (int): The index of the first token of the span. end (int): The index of the first token after the span. label (uint64): A label to attach to the Span, e.g. for named entities. + kb_id (uint64): An identifier from a Knowledge Base to capture the meaning of a named entity. vector (ndarray[ndim=1, dtype='float32']): A meaning representation of the span. RETURNS (Span): The newly constructed object. @@ -110,11 +111,14 @@ cdef class Span: self.end_char = 0 if isinstance(label, basestring_): label = doc.vocab.strings.add(label) + if isinstance(kb_id, basestring_): + kb_id = doc.vocab.strings.add(kb_id) if label not in doc.vocab.strings: raise ValueError(Errors.E084.format(label=label)) self.label = label self._vector = vector self._vector_norm = vector_norm + self.kb_id = kb_id def __richcmp__(self, Span other, int op): if other is None: @@ -655,6 +659,20 @@ cdef class Span: label_ = '' raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_)) + property kb_id_: + """RETURNS (unicode): The named entity's KB ID.""" + def __get__(self): + return self.doc.vocab.strings[self.kb_id] + + def __set__(self, unicode kb_id_): + if not kb_id_: + kb_id_ = '' + current_label = self.label_ + if not current_label: + current_label = '' + raise NotImplementedError(Errors.E131.format(start=self.start, end=self.end, + label=current_label, kb_id=kb_id_)) + cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1: # Don't allow spaces to be the root, if there are diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx index bdf6a8dd5..eb79de16b 100644 --- a/spacy/tokens/token.pyx +++ b/spacy/tokens/token.pyx @@ -770,6 +770,22 @@ cdef class Token: def __set__(self, name): self.c.ent_id = self.vocab.strings.add(name) + property ent_kb_id: + """RETURNS (uint64): Named entity KB ID.""" + def __get__(self): + return self.c.ent_kb_id + + def __set__(self, attr_t ent_kb_id): + self.c.ent_kb_id = ent_kb_id + + property ent_kb_id_: + """RETURNS (unicode): Named entity KB ID.""" + def __get__(self): + return self.vocab.strings[self.c.ent_kb_id] + + def __set__(self, ent_kb_id): + self.c.ent_kb_id = self.vocab.strings.add(ent_kb_id) + @property def whitespace_(self): """RETURNS (unicode): The trailing whitespace character, if present.""" diff --git a/spacy/util.py b/spacy/util.py index 7a36fe958..1cea8b6ca 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -507,13 +507,10 @@ def stepping(start, stop, steps): def decaying(start, stop, decay): """Yield an infinite series of linearly decaying values.""" - def clip(value): - return max(value, stop) if (start > stop) else min(value, stop) - - nr_upd = 1.0 + curr = float(start) while True: - yield clip(start * 1.0 / (1.0 + decay * nr_upd)) - nr_upd += 1 + yield max(curr, stop) + curr -= (decay) def minibatch_by_words(items, size, tuples=True, count_words=len): diff --git a/website/docs/api/tokenizer.md b/website/docs/api/tokenizer.md index 50f4fceae..3bec5b165 100644 --- a/website/docs/api/tokenizer.md +++ b/website/docs/api/tokenizer.md @@ -64,7 +64,7 @@ Tokenize a stream of texts. | Name | Type | Description | | ------------ | ----- | -------------------------------------------------------- | | `texts` | - | A sequence of unicode texts. | -| `batch_size` | int | The number of texts to accumulate in an internal buffer. | +| `batch_size` | int | The number of texts to accumulate in an internal buffer. Defaults to `1000`.| | **YIELDS** | `Doc` | A sequence of Doc objects, in order. | ## Tokenizer.find_infix {#find_infix tag="method"} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index ff125d2eb..57af729f0 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -622,10 +622,10 @@ Yield an infinite series of linearly decaying values. > #### Example > > ```python -> sizes = decaying(1., 10., 0.001) -> assert next(sizes) == 1. -> assert next(sizes) == 1. - 0.001 -> assert next(sizes) == 0.999 - 0.001 +> sizes = decaying(10., 1., 0.001) +> assert next(sizes) == 10. +> assert next(sizes) == 10. - 0.001 +> assert next(sizes) == 9.999 - 0.001 > ``` | Name | Type | Description | diff --git a/website/gatsby-node.js b/website/gatsby-node.js index ddc060b01..4aaf5f45e 100644 --- a/website/gatsby-node.js +++ b/website/gatsby-node.js @@ -40,6 +40,7 @@ exports.createPages = ({ graphql, actions }) => { resources { id title + slogan } categories { label @@ -178,6 +179,7 @@ exports.createPages = ({ graphql, actions }) => { slug: slug, isIndex: false, title: page.title || page.id, + teaser: page.slogan, data: { ...page, isProject: true }, ...universeContext, }, diff --git a/website/meta/languages.json b/website/meta/languages.json index 600500c7a..defb08037 100644 --- a/website/meta/languages.json +++ b/website/meta/languages.json @@ -117,6 +117,7 @@ { "code": "sk", "name": "Slovak" }, { "code": "sl", "name": "Slovenian" }, { "code": "sq", "name": "Albanian" }, + { "code": "et", "name": "Estonian" }, { "code": "th", "name": "Thai", diff --git a/website/meta/site.json b/website/meta/site.json index 4323d7991..1820ff5df 100644 --- a/website/meta/site.json +++ b/website/meta/site.json @@ -29,7 +29,7 @@ "spacyVersion": "2.1", "binderUrl": "ines/spacy-io-binder", "binderBranch": "live", - "binderVersion": "2.1.2", + "binderVersion": "2.1.3", "sections": [ { "id": "usage", "title": "Usage Documentation", "theme": "blue" }, { "id": "models", "title": "Models Documentation", "theme": "blue" }, diff --git a/website/meta/universe.json b/website/meta/universe.json index e81654a07..94d3e9319 100644 --- a/website/meta/universe.json +++ b/website/meta/universe.json @@ -554,6 +554,36 @@ }, "category": ["standalone"] }, + { + "id": "textpipe", + "slogan": "clean and extract metadata from text", + "description": "`textpipe` is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.", + "github": "textpipe/textpipe", + "pip": "textpipe", + "author": "Textpipe Contributors", + "author_links": { + "github": "textpipe", + "website": "https://github.com/textpipe/textpipe/blob/master/CONTRIBUTORS.md" + }, + "category": ["standalone"], + "tags": ["text-processing", "named-entity-recognition"], + "thumb": "https://avatars0.githubusercontent.com/u/40492530", + "code_example": [ + "from textpipe import doc, pipeline", + "sample_text = 'Sample text! '", + "document = doc.Doc(sample_text)", + "print(document.clean)", + "'Sample text!'", + "print(document.language)", + "# 'en'", + "print(document.nwords)", + "# 2", + "", + "pipe = pipeline.Pipeline(['CleanText', 'NWords'])", + "print(pipe(sample_text))", + "# {'CleanText': 'Sample text!', 'NWords': 2}" + ] + }, { "id": "mordecai", "slogan": "Full text geoparsing using spaCy, Geonames and Keras", diff --git a/website/src/components/landing.js b/website/src/components/landing.js index e84534820..16c342e3f 100644 --- a/website/src/components/landing.js +++ b/website/src/components/landing.js @@ -75,14 +75,28 @@ export const LandingBannerGrid = ({ children }) => ( ) -export const LandingBanner = ({ title, label, to, button, small, background, color, children }) => { +export const LandingBanner = ({ + title, + label, + to, + button, + small, + background, + backgroundImage, + color, + children, +}) => { const contentClassNames = classNames(classes.bannerContent, { [classes.bannerContentSmall]: small, }) const textClassNames = classNames(classes.bannerText, { [classes.bannerTextSmall]: small, }) - const style = { '--color-theme': background, '--color-back': color } + const style = { + '--color-theme': background, + '--color-back': color, + backgroundImage: backgroundImage ? `url(${backgroundImage})` : null, + } const Heading = small ? H2 : H1 return (
@@ -113,7 +127,7 @@ export const LandingBanner = ({ title, label, to, button, small, background, col export const LandingBannerButton = ({ to, small, children }) => (
-
diff --git a/website/src/images/icon.png b/website/src/images/icon.png index 344996d90..5cfc8abe1 100644 Binary files a/website/src/images/icon.png and b/website/src/images/icon.png differ diff --git a/website/src/images/spacy-irl.jpg b/website/src/images/spacy-irl.jpg new file mode 100644 index 000000000..ee8f4bdc9 Binary files /dev/null and b/website/src/images/spacy-irl.jpg differ diff --git a/website/src/styles/landing.module.sass b/website/src/styles/landing.module.sass index efe3d3e5a..d7340229b 100644 --- a/website/src/styles/landing.module.sass +++ b/website/src/styles/landing.module.sass @@ -73,6 +73,7 @@ color: var(--color-back) padding: 5rem margin-bottom: var(--spacing-md) + background-size: cover .banner-content margin-bottom: 0 @@ -100,7 +101,7 @@ .banner-text-small p font-size: 1.35rem - margin-bottom: 1rem + margin-bottom: 1.5rem @include breakpoint(min, md) .banner-content @@ -134,6 +135,9 @@ margin-bottom: var(--spacing-sm) text-align: right +.banner-button-element + background: var(--color-theme) + .logos text-align: center padding-bottom: 1rem diff --git a/website/src/widgets/landing.js b/website/src/widgets/landing.js index 6905d46d0..9e6e95c2d 100644 --- a/website/src/widgets/landing.js +++ b/website/src/widgets/landing.js @@ -19,6 +19,7 @@ import { H2 } from '../components/typography' import { Ul, Li } from '../components/list' import Button from '../components/button' import Link from '../components/link' +import irlBackground from '../images/spacy-irl.jpg' import BenchmarksChoi from 'usage/_benchmarks-choi.md' @@ -151,19 +152,21 @@ const Landing = ({ data }) => { - Learn more from small training corpora by initializing your models with{' '} - knowledge from raw text. The new pretrain command teaches - spaCy's CNN model to predict words based on their context, producing - representations of words in contexts. If you've seen Google's BERT system or - fast.ai's ULMFiT, spaCy's pretraining is similar – but much more efficient. It's - still experimental, but users are already reporting good results, so give it a - try! + We're pleased to invite the spaCy community and other folks working on Natural + Language Processing to Berlin this summer for a small and intimate event{' '} + July 5-6, 2019. The event includes a hands-on training day for + teams using spaCy in production, followed by a one-track conference. We booked a + beautiful venue, hand-picked an awesome lineup of speakers and scheduled plenty + of social time to get to know each other and exchange ideas. { - spaCy v2.0 features new neural models for tagging,{' '} - parsing and entity recognition. The models have - been designed and implemented from scratch specifically for spaCy, to give you an - unmatched balance of speed, size and accuracy. A novel bloom embedding strategy with - subword features is used to support huge vocabularies in tiny tables. Convolutional - layers with residual connections, layer normalization and maxout non-linearity are - used, giving much better efficiency than the standard BiLSTM solution. Finally, the - parser and NER use an imitation learning objective to deliver accuracy in-line with - the latest research systems, even when evaluated from raw text. With these - innovations, spaCy v2.0's models are 10× smaller,{' '} - 20% more accurate, and - even cheaper to run than the previous generation. + Learn more from small training corpora by initializing your models with{' '} + knowledge from raw text. The new pretrain command teaches spaCy's + CNN model to predict words based on their context, producing representations of + words in contexts. If you've seen Google's BERT system or fast.ai's ULMFiT, spaCy's + pretraining is similar – but much more efficient. It's still experimental, but users + are already reporting good results, so give it a try!