diff --git a/.github/contributors/cedar101.md b/.github/contributors/cedar101.md new file mode 100644 index 000000000..4d04ebacf --- /dev/null +++ b/.github/contributors/cedar101.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------ | +| Name | Kim, Baeg-il | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2019-07-03 | +| GitHub username | cedar101 | +| Website (optional) | | diff --git a/.github/contributors/yashpatadia.md b/.github/contributors/yashpatadia.md new file mode 100644 index 000000000..2dcf9211d --- /dev/null +++ b/.github/contributors/yashpatadia.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Yash Patadia | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 11/07/2019 | +| GitHub username | yash1994 | +| Website (optional) | | \ No newline at end of file diff --git a/.gitignore b/.gitignore index ef586ac8d..35d431d48 100644 --- a/.gitignore +++ b/.gitignore @@ -56,6 +56,8 @@ parts/ sdist/ var/ *.egg-info/ +pip-wheel-metadata/ +Pipfile.lock .installed.cfg *.egg .eggs diff --git a/bin/__init__.py b/bin/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/bin/train_word_vectors.py b/bin/train_word_vectors.py index 624e339a0..663ce060d 100644 --- a/bin/train_word_vectors.py +++ b/bin/train_word_vectors.py @@ -5,7 +5,6 @@ import logging from pathlib import Path from collections import defaultdict from gensim.models import Word2Vec -from preshed.counter import PreshCounter import plac import spacy diff --git a/bin/ud/conll17_ud_eval.py b/bin/ud/conll17_ud_eval.py index 78a976a6d..88acfabac 100644 --- a/bin/ud/conll17_ud_eval.py +++ b/bin/ud/conll17_ud_eval.py @@ -292,8 +292,8 @@ def evaluate(gold_ud, system_ud, deprel_weights=None, check_parse=True): def spans_score(gold_spans, system_spans): correct, gi, si = 0, 0, 0 - undersegmented = list() - oversegmented = list() + undersegmented = [] + oversegmented = [] combo = 0 previous_end_si_earlier = False previous_end_gi_earlier = False diff --git a/bin/wiki_entity_linking/__init__.py b/bin/wiki_entity_linking/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/bin/wiki_entity_linking/kb_creator.py b/bin/wiki_entity_linking/kb_creator.py new file mode 100644 index 000000000..e8e081cef --- /dev/null +++ b/bin/wiki_entity_linking/kb_creator.py @@ -0,0 +1,171 @@ +# coding: utf-8 +from __future__ import unicode_literals + +from .train_descriptions import EntityEncoder +from . import wikidata_processor as wd, wikipedia_processor as wp +from spacy.kb import KnowledgeBase + +import csv +import datetime + + +INPUT_DIM = 300 # dimension of pre-trained input vectors +DESC_WIDTH = 64 # dimension of output entity vectors + + +def create_kb(nlp, max_entities_per_alias, min_entity_freq, min_occ, + entity_def_output, entity_descr_output, + count_input, prior_prob_input, wikidata_input): + # Create the knowledge base from Wikidata entries + kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=DESC_WIDTH) + + # disable this part of the pipeline when rerunning the KB generation from preprocessed files + read_raw_data = True + + if read_raw_data: + print() + print(" * _read_wikidata_entities", datetime.datetime.now()) + title_to_id, id_to_descr = wd.read_wikidata_entities_json(wikidata_input) + + # write the title-ID and ID-description mappings to file + _write_entity_files(entity_def_output, entity_descr_output, title_to_id, id_to_descr) + + else: + # read the mappings from file + title_to_id = get_entity_to_id(entity_def_output) + id_to_descr = get_id_to_description(entity_descr_output) + + print() + print(" * _get_entity_frequencies", datetime.datetime.now()) + print() + entity_frequencies = wp.get_all_frequencies(count_input=count_input) + + # filter the entities for in the KB by frequency, because there's just too much data (8M entities) otherwise + filtered_title_to_id = dict() + entity_list = [] + description_list = [] + frequency_list = [] + for title, entity in title_to_id.items(): + freq = entity_frequencies.get(title, 0) + desc = id_to_descr.get(entity, None) + if desc and freq > min_entity_freq: + entity_list.append(entity) + description_list.append(desc) + frequency_list.append(freq) + filtered_title_to_id[title] = entity + + print("Kept", len(filtered_title_to_id.keys()), "out of", len(title_to_id.keys()), + "titles with filter frequency", min_entity_freq) + + print() + print(" * train entity encoder", datetime.datetime.now()) + print() + encoder = EntityEncoder(nlp, INPUT_DIM, DESC_WIDTH) + encoder.train(description_list=description_list, to_print=True) + + print() + print(" * get entity embeddings", datetime.datetime.now()) + print() + embeddings = encoder.apply_encoder(description_list) + + print() + print(" * adding", len(entity_list), "entities", datetime.datetime.now()) + kb.set_entities(entity_list=entity_list, prob_list=frequency_list, vector_list=embeddings) + + print() + print(" * adding aliases", datetime.datetime.now()) + print() + _add_aliases(kb, title_to_id=filtered_title_to_id, + max_entities_per_alias=max_entities_per_alias, min_occ=min_occ, + prior_prob_input=prior_prob_input) + + print() + print("kb size:", len(kb), kb.get_size_entities(), kb.get_size_aliases()) + + print("done with kb", datetime.datetime.now()) + return kb + + +def _write_entity_files(entity_def_output, entity_descr_output, title_to_id, id_to_descr): + with open(entity_def_output, mode='w', encoding='utf8') as id_file: + id_file.write("WP_title" + "|" + "WD_id" + "\n") + for title, qid in title_to_id.items(): + id_file.write(title + "|" + str(qid) + "\n") + + with open(entity_descr_output, mode='w', encoding='utf8') as descr_file: + descr_file.write("WD_id" + "|" + "description" + "\n") + for qid, descr in id_to_descr.items(): + descr_file.write(str(qid) + "|" + descr + "\n") + + +def get_entity_to_id(entity_def_output): + entity_to_id = dict() + with open(entity_def_output, 'r', encoding='utf8') as csvfile: + csvreader = csv.reader(csvfile, delimiter='|') + # skip header + next(csvreader) + for row in csvreader: + entity_to_id[row[0]] = row[1] + return entity_to_id + + +def get_id_to_description(entity_descr_output): + id_to_desc = dict() + with open(entity_descr_output, 'r', encoding='utf8') as csvfile: + csvreader = csv.reader(csvfile, delimiter='|') + # skip header + next(csvreader) + for row in csvreader: + id_to_desc[row[0]] = row[1] + return id_to_desc + + +def _add_aliases(kb, title_to_id, max_entities_per_alias, min_occ, prior_prob_input): + wp_titles = title_to_id.keys() + + # adding aliases with prior probabilities + # we can read this file sequentially, it's sorted by alias, and then by count + with open(prior_prob_input, mode='r', encoding='utf8') as prior_file: + # skip header + prior_file.readline() + line = prior_file.readline() + previous_alias = None + total_count = 0 + counts = [] + entities = [] + while line: + splits = line.replace('\n', "").split(sep='|') + new_alias = splits[0] + count = int(splits[1]) + entity = splits[2] + + if new_alias != previous_alias and previous_alias: + # done reading the previous alias --> output + if len(entities) > 0: + selected_entities = [] + prior_probs = [] + for ent_count, ent_string in zip(counts, entities): + if ent_string in wp_titles: + wd_id = title_to_id[ent_string] + p_entity_givenalias = ent_count / total_count + selected_entities.append(wd_id) + prior_probs.append(p_entity_givenalias) + + if selected_entities: + try: + kb.add_alias(alias=previous_alias, entities=selected_entities, probabilities=prior_probs) + except ValueError as e: + print(e) + total_count = 0 + counts = [] + entities = [] + + total_count += count + + if len(entities) < max_entities_per_alias and count >= min_occ: + counts.append(count) + entities.append(entity) + previous_alias = new_alias + + line = prior_file.readline() + diff --git a/bin/wiki_entity_linking/train_descriptions.py b/bin/wiki_entity_linking/train_descriptions.py new file mode 100644 index 000000000..6a4d046e5 --- /dev/null +++ b/bin/wiki_entity_linking/train_descriptions.py @@ -0,0 +1,121 @@ +# coding: utf-8 +from random import shuffle + +import numpy as np + +from spacy._ml import zero_init, create_default_optimizer +from spacy.cli.pretrain import get_cossim_loss + +from thinc.v2v import Model +from thinc.api import chain +from thinc.neural._classes.affine import Affine + + +class EntityEncoder: + """ + Train the embeddings of entity descriptions to fit a fixed-size entity vector (e.g. 64D). + This entity vector will be stored in the KB, for further downstream use in the entity model. + """ + + DROP = 0 + EPOCHS = 5 + STOP_THRESHOLD = 0.04 + + BATCH_SIZE = 1000 + + def __init__(self, nlp, input_dim, desc_width): + self.nlp = nlp + self.input_dim = input_dim + self.desc_width = desc_width + + def apply_encoder(self, description_list): + if self.encoder is None: + raise ValueError("Can not apply encoder before training it") + + batch_size = 100000 + + start = 0 + stop = min(batch_size, len(description_list)) + encodings = [] + + while start < len(description_list): + docs = list(self.nlp.pipe(description_list[start:stop])) + doc_embeddings = [self._get_doc_embedding(doc) for doc in docs] + enc = self.encoder(np.asarray(doc_embeddings)) + encodings.extend(enc.tolist()) + + start = start + batch_size + stop = min(stop + batch_size, len(description_list)) + + return encodings + + def train(self, description_list, to_print=False): + processed, loss = self._train_model(description_list) + if to_print: + print("Trained on", processed, "entities across", self.EPOCHS, "epochs") + print("Final loss:", loss) + + def _train_model(self, description_list): + # TODO: when loss gets too low, a 'mean of empty slice' warning is thrown by numpy + + self._build_network(self.input_dim, self.desc_width) + + processed = 0 + loss = 1 + descriptions = description_list.copy() # copy this list so that shuffling does not affect other functions + + for i in range(self.EPOCHS): + shuffle(descriptions) + + batch_nr = 0 + start = 0 + stop = min(self.BATCH_SIZE, len(descriptions)) + + while loss > self.STOP_THRESHOLD and start < len(descriptions): + batch = [] + for descr in descriptions[start:stop]: + doc = self.nlp(descr) + doc_vector = self._get_doc_embedding(doc) + batch.append(doc_vector) + + loss = self._update(batch) + print(i, batch_nr, loss) + processed += len(batch) + + batch_nr += 1 + start = start + self.BATCH_SIZE + stop = min(stop + self.BATCH_SIZE, len(descriptions)) + + return processed, loss + + @staticmethod + def _get_doc_embedding(doc): + indices = np.zeros((len(doc),), dtype="i") + for i, word in enumerate(doc): + if word.orth in doc.vocab.vectors.key2row: + indices[i] = doc.vocab.vectors.key2row[word.orth] + else: + indices[i] = 0 + word_vectors = doc.vocab.vectors.data[indices] + doc_vector = np.mean(word_vectors, axis=0) + return doc_vector + + def _build_network(self, orig_width, hidden_with): + with Model.define_operators({">>": chain}): + # very simple encoder-decoder model + self.encoder = ( + Affine(hidden_with, orig_width) + ) + self.model = self.encoder >> zero_init(Affine(orig_width, hidden_with, drop_factor=0.0)) + self.sgd = create_default_optimizer(self.model.ops) + + def _update(self, vectors): + predictions, bp_model = self.model.begin_update(np.asarray(vectors), drop=self.DROP) + loss, d_scores = self._get_loss(scores=predictions, golds=np.asarray(vectors)) + bp_model(d_scores, sgd=self.sgd) + return loss / len(vectors) + + @staticmethod + def _get_loss(golds, scores): + loss, gradients = get_cossim_loss(scores, golds) + return loss, gradients diff --git a/bin/wiki_entity_linking/training_set_creator.py b/bin/wiki_entity_linking/training_set_creator.py new file mode 100644 index 000000000..5d401bb3f --- /dev/null +++ b/bin/wiki_entity_linking/training_set_creator.py @@ -0,0 +1,353 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import os +import re +import bz2 +import datetime + +from spacy.gold import GoldParse +from bin.wiki_entity_linking import kb_creator + +""" +Process Wikipedia interlinks to generate a training dataset for the EL algorithm. +Gold-standard entities are stored in one file in standoff format (by character offset). +""" + +ENTITY_FILE = "gold_entities.csv" + + +def create_training(wikipedia_input, entity_def_input, training_output): + wp_to_id = kb_creator.get_entity_to_id(entity_def_input) + _process_wikipedia_texts(wikipedia_input, wp_to_id, training_output, limit=None) + + +def _process_wikipedia_texts(wikipedia_input, wp_to_id, training_output, limit=None): + """ + Read the XML wikipedia data to parse out training data: + raw text data + positive instances + """ + title_regex = re.compile(r'(?<=<title>).*(?=</title>)') + id_regex = re.compile(r'(?<=<id>)\d*(?=</id>)') + + read_ids = set() + entityfile_loc = training_output / ENTITY_FILE + with open(entityfile_loc, mode="w", encoding='utf8') as entityfile: + # write entity training header file + _write_training_entity(outputfile=entityfile, + article_id="article_id", + alias="alias", + entity="WD_id", + start="start", + end="end") + + with bz2.open(wikipedia_input, mode='rb') as file: + line = file.readline() + cnt = 0 + article_text = "" + article_title = None + article_id = None + reading_text = False + reading_revision = False + while line and (not limit or cnt < limit): + if cnt % 1000000 == 0: + print(datetime.datetime.now(), "processed", cnt, "lines of Wikipedia dump") + clean_line = line.strip().decode("utf-8") + + if clean_line == "<revision>": + reading_revision = True + elif clean_line == "</revision>": + reading_revision = False + + # Start reading new page + if clean_line == "<page>": + article_text = "" + article_title = None + article_id = None + + # finished reading this page + elif clean_line == "</page>": + if article_id: + try: + _process_wp_text(wp_to_id, entityfile, article_id, article_title, article_text.strip(), + training_output) + except Exception as e: + print("Error processing article", article_id, article_title, e) + else: + print("Done processing a page, but couldn't find an article_id ?", article_title) + article_text = "" + article_title = None + article_id = None + reading_text = False + reading_revision = False + + # start reading text within a page + if "<text" in clean_line: + reading_text = True + + if reading_text: + article_text += " " + clean_line + + # stop reading text within a page (we assume a new page doesn't start on the same line) + if "</text" in clean_line: + reading_text = False + + # read the ID of this article (outside the revision portion of the document) + if not reading_revision: + ids = id_regex.search(clean_line) + if ids: + article_id = ids[0] + if article_id in read_ids: + print("Found duplicate article ID", article_id, clean_line) # This should never happen ... + read_ids.add(article_id) + + # read the title of this article (outside the revision portion of the document) + if not reading_revision: + titles = title_regex.search(clean_line) + if titles: + article_title = titles[0].strip() + + line = file.readline() + cnt += 1 + + +text_regex = re.compile(r'(?<=<text xml:space=\"preserve\">).*(?=</text)') + + +def _process_wp_text(wp_to_id, entityfile, article_id, article_title, article_text, training_output): + found_entities = False + + # ignore meta Wikipedia pages + if article_title.startswith("Wikipedia:"): + return + + # remove the text tags + text = text_regex.search(article_text).group(0) + + # stop processing if this is a redirect page + if text.startswith("#REDIRECT"): + return + + # get the raw text without markup etc, keeping only interwiki links + clean_text = _get_clean_wp_text(text) + + # read the text char by char to get the right offsets for the interwiki links + final_text = "" + open_read = 0 + reading_text = True + reading_entity = False + reading_mention = False + reading_special_case = False + entity_buffer = "" + mention_buffer = "" + for index, letter in enumerate(clean_text): + if letter == '[': + open_read += 1 + elif letter == ']': + open_read -= 1 + elif letter == '|': + if reading_text: + final_text += letter + # switch from reading entity to mention in the [[entity|mention]] pattern + elif reading_entity: + reading_text = False + reading_entity = False + reading_mention = True + else: + reading_special_case = True + else: + if reading_entity: + entity_buffer += letter + elif reading_mention: + mention_buffer += letter + elif reading_text: + final_text += letter + else: + raise ValueError("Not sure at point", clean_text[index-2:index+2]) + + if open_read > 2: + reading_special_case = True + + if open_read == 2 and reading_text: + reading_text = False + reading_entity = True + reading_mention = False + + # we just finished reading an entity + if open_read == 0 and not reading_text: + if '#' in entity_buffer or entity_buffer.startswith(':'): + reading_special_case = True + # Ignore cases with nested structures like File: handles etc + if not reading_special_case: + if not mention_buffer: + mention_buffer = entity_buffer + start = len(final_text) + end = start + len(mention_buffer) + qid = wp_to_id.get(entity_buffer, None) + if qid: + _write_training_entity(outputfile=entityfile, + article_id=article_id, + alias=mention_buffer, + entity=qid, + start=start, + end=end) + found_entities = True + final_text += mention_buffer + + entity_buffer = "" + mention_buffer = "" + + reading_text = True + reading_entity = False + reading_mention = False + reading_special_case = False + + if found_entities: + _write_training_article(article_id=article_id, clean_text=final_text, training_output=training_output) + + +info_regex = re.compile(r'{[^{]*?}') +htlm_regex = re.compile(r'<!--[^-]*-->') +category_regex = re.compile(r'\[\[Category:[^\[]*]]') +file_regex = re.compile(r'\[\[File:[^[\]]+]]') +ref_regex = re.compile(r'<ref.*?>') # non-greedy +ref_2_regex = re.compile(r'</ref.*?>') # non-greedy + + +def _get_clean_wp_text(article_text): + clean_text = article_text.strip() + + # remove bolding & italic markup + clean_text = clean_text.replace('\'\'\'', '') + clean_text = clean_text.replace('\'\'', '') + + # remove nested {{info}} statements by removing the inner/smallest ones first and iterating + try_again = True + previous_length = len(clean_text) + while try_again: + clean_text = info_regex.sub('', clean_text) # non-greedy match excluding a nested { + if len(clean_text) < previous_length: + try_again = True + else: + try_again = False + previous_length = len(clean_text) + + # remove HTML comments + clean_text = htlm_regex.sub('', clean_text) + + # remove Category and File statements + clean_text = category_regex.sub('', clean_text) + clean_text = file_regex.sub('', clean_text) + + # remove multiple = + while '==' in clean_text: + clean_text = clean_text.replace("==", "=") + + clean_text = clean_text.replace(". =", ".") + clean_text = clean_text.replace(" = ", ". ") + clean_text = clean_text.replace("= ", ".") + clean_text = clean_text.replace(" =", "") + + # remove refs (non-greedy match) + clean_text = ref_regex.sub('', clean_text) + clean_text = ref_2_regex.sub('', clean_text) + + # remove additional wikiformatting + clean_text = re.sub(r'<blockquote>', '', clean_text) + clean_text = re.sub(r'</blockquote>', '', clean_text) + + # change special characters back to normal ones + clean_text = clean_text.replace(r'<', '<') + clean_text = clean_text.replace(r'>', '>') + clean_text = clean_text.replace(r'"', '"') + clean_text = clean_text.replace(r'&nbsp;', ' ') + clean_text = clean_text.replace(r'&', '&') + + # remove multiple spaces + while ' ' in clean_text: + clean_text = clean_text.replace(' ', ' ') + + return clean_text.strip() + + +def _write_training_article(article_id, clean_text, training_output): + file_loc = training_output / str(article_id) + ".txt" + with open(file_loc, mode='w', encoding='utf8') as outputfile: + outputfile.write(clean_text) + + +def _write_training_entity(outputfile, article_id, alias, entity, start, end): + outputfile.write(article_id + "|" + alias + "|" + entity + "|" + str(start) + "|" + str(end) + "\n") + + +def is_dev(article_id): + return article_id.endswith("3") + + +def read_training(nlp, training_dir, dev, limit): + # This method provides training examples that correspond to the entity annotations found by the nlp object + entityfile_loc = training_dir / ENTITY_FILE + data = [] + + # assume the data is written sequentially, so we can reuse the article docs + current_article_id = None + current_doc = None + ents_by_offset = dict() + skip_articles = set() + total_entities = 0 + + with open(entityfile_loc, mode='r', encoding='utf8') as file: + for line in file: + if not limit or len(data) < limit: + fields = line.replace('\n', "").split(sep='|') + article_id = fields[0] + alias = fields[1] + wp_title = fields[2] + start = fields[3] + end = fields[4] + + if dev == is_dev(article_id) and article_id != "article_id" and article_id not in skip_articles: + if not current_doc or (current_article_id != article_id): + # parse the new article text + file_name = article_id + ".txt" + try: + with open(os.path.join(training_dir, file_name), mode="r", encoding='utf8') as f: + text = f.read() + if len(text) < 30000: # threshold for convenience / speed of processing + current_doc = nlp(text) + current_article_id = article_id + ents_by_offset = dict() + for ent in current_doc.ents: + sent_length = len(ent.sent) + # custom filtering to avoid too long or too short sentences + if 5 < sent_length < 100: + ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)] = ent + else: + skip_articles.add(article_id) + current_doc = None + except Exception as e: + print("Problem parsing article", article_id, e) + skip_articles.add(article_id) + raise e + + # repeat checking this condition in case an exception was thrown + if current_doc and (current_article_id == article_id): + found_ent = ents_by_offset.get(start + "_" + end, None) + if found_ent: + if found_ent.text != alias: + skip_articles.add(article_id) + current_doc = None + else: + sent = found_ent.sent.as_doc() + # currently feeding the gold data one entity per sentence at a time + gold_start = int(start) - found_ent.sent.start_char + gold_end = int(end) - found_ent.sent.start_char + gold_entities = [(gold_start, gold_end, wp_title)] + gold = GoldParse(doc=sent, links=gold_entities) + data.append((sent, gold)) + total_entities += 1 + if len(data) % 2500 == 0: + print(" -read", total_entities, "entities") + + print(" -read", total_entities, "entities") + return data diff --git a/bin/wiki_entity_linking/wikidata_processor.py b/bin/wiki_entity_linking/wikidata_processor.py new file mode 100644 index 000000000..a32a0769a --- /dev/null +++ b/bin/wiki_entity_linking/wikidata_processor.py @@ -0,0 +1,119 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import bz2 +import json +import datetime + + +def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False): + # Read the JSON wiki data and parse out the entities. Takes about 7u30 to parse 55M lines. + # get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/ + + lang = 'en' + site_filter = 'enwiki' + + # properties filter (currently disabled to get ALL data) + prop_filter = dict() + # prop_filter = {'P31': {'Q5', 'Q15632617'}} # currently defined as OR: one property suffices to be selected + + title_to_id = dict() + id_to_descr = dict() + + # parse appropriate fields - depending on what we need in the KB + parse_properties = False + parse_sitelinks = True + parse_labels = False + parse_descriptions = True + parse_aliases = False + parse_claims = False + + with bz2.open(wikidata_file, mode='rb') as file: + line = file.readline() + cnt = 0 + while line and (not limit or cnt < limit): + if cnt % 500000 == 0: + print(datetime.datetime.now(), "processed", cnt, "lines of WikiData dump") + clean_line = line.strip() + if clean_line.endswith(b","): + clean_line = clean_line[:-1] + if len(clean_line) > 1: + obj = json.loads(clean_line) + entry_type = obj["type"] + + if entry_type == "item": + # filtering records on their properties (currently disabled to get ALL data) + # keep = False + keep = True + + claims = obj["claims"] + if parse_claims: + for prop, value_set in prop_filter.items(): + claim_property = claims.get(prop, None) + if claim_property: + for cp in claim_property: + cp_id = cp['mainsnak'].get('datavalue', {}).get('value', {}).get('id') + cp_rank = cp['rank'] + if cp_rank != "deprecated" and cp_id in value_set: + keep = True + + if keep: + unique_id = obj["id"] + + if to_print: + print("ID:", unique_id) + print("type:", entry_type) + + # parsing all properties that refer to other entities + if parse_properties: + for prop, claim_property in claims.items(): + cp_dicts = [cp['mainsnak']['datavalue'].get('value') for cp in claim_property + if cp['mainsnak'].get('datavalue')] + cp_values = [cp_dict.get('id') for cp_dict in cp_dicts if isinstance(cp_dict, dict) + if cp_dict.get('id') is not None] + if cp_values: + if to_print: + print("prop:", prop, cp_values) + + found_link = False + if parse_sitelinks: + site_value = obj["sitelinks"].get(site_filter, None) + if site_value: + site = site_value['title'] + if to_print: + print(site_filter, ":", site) + title_to_id[site] = unique_id + found_link = True + + if parse_labels: + labels = obj["labels"] + if labels: + lang_label = labels.get(lang, None) + if lang_label: + if to_print: + print("label (" + lang + "):", lang_label["value"]) + + if found_link and parse_descriptions: + descriptions = obj["descriptions"] + if descriptions: + lang_descr = descriptions.get(lang, None) + if lang_descr: + if to_print: + print("description (" + lang + "):", lang_descr["value"]) + id_to_descr[unique_id] = lang_descr["value"] + + if parse_aliases: + aliases = obj["aliases"] + if aliases: + lang_aliases = aliases.get(lang, None) + if lang_aliases: + for item in lang_aliases: + if to_print: + print("alias (" + lang + "):", item["value"]) + + if to_print: + print() + line = file.readline() + cnt += 1 + + return title_to_id, id_to_descr diff --git a/bin/wiki_entity_linking/wikipedia_processor.py b/bin/wiki_entity_linking/wikipedia_processor.py new file mode 100644 index 000000000..c02e472bc --- /dev/null +++ b/bin/wiki_entity_linking/wikipedia_processor.py @@ -0,0 +1,182 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import re +import bz2 +import csv +import datetime + +""" +Process a Wikipedia dump to calculate entity frequencies and prior probabilities in combination with certain mentions. +Write these results to file for downstream KB and training data generation. +""" + +map_alias_to_link = dict() + +# these will/should be matched ignoring case +wiki_namespaces = ["b", "betawikiversity", "Book", "c", "Category", "Commons", + "d", "dbdump", "download", "Draft", "Education", "Foundation", + "Gadget", "Gadget definition", "gerrit", "File", "Help", "Image", "Incubator", + "m", "mail", "mailarchive", "media", "MediaWiki", "MediaWiki talk", "Mediawikiwiki", + "MediaZilla", "Meta", "Metawikipedia", "Module", + "mw", "n", "nost", "oldwikisource", "outreach", "outreachwiki", "otrs", "OTRSwiki", + "Portal", "phab", "Phabricator", "Project", "q", "quality", "rev", + "s", "spcom", "Special", "species", "Strategy", "sulutil", "svn", + "Talk", "Template", "Template talk", "Testwiki", "ticket", "TimedText", "Toollabs", "tools", + "tswiki", "User", "User talk", "v", "voy", + "w", "Wikibooks", "Wikidata", "wikiHow", "Wikinvest", "wikilivres", "Wikimedia", "Wikinews", + "Wikipedia", "Wikipedia talk", "Wikiquote", "Wikisource", "Wikispecies", "Wikitech", + "Wikiversity", "Wikivoyage", "wikt", "wiktionary", "wmf", "wmania", "WP"] + +# find the links +link_regex = re.compile(r'\[\[[^\[\]]*\]\]') + +# match on interwiki links, e.g. `en:` or `:fr:` +ns_regex = r":?" + "[a-z][a-z]" + ":" + +# match on Namespace: optionally preceded by a : +for ns in wiki_namespaces: + ns_regex += "|" + ":?" + ns + ":" + +ns_regex = re.compile(ns_regex, re.IGNORECASE) + + +def read_wikipedia_prior_probs(wikipedia_input, prior_prob_output): + """ + Read the XML wikipedia data and parse out intra-wiki links to estimate prior probabilities. + The full file takes about 2h to parse 1100M lines. + It works relatively fast because it runs line by line, irrelevant of which article the intrawiki is from. + """ + with bz2.open(wikipedia_input, mode='rb') as file: + line = file.readline() + cnt = 0 + while line: + if cnt % 5000000 == 0: + print(datetime.datetime.now(), "processed", cnt, "lines of Wikipedia dump") + clean_line = line.strip().decode("utf-8") + + aliases, entities, normalizations = get_wp_links(clean_line) + for alias, entity, norm in zip(aliases, entities, normalizations): + _store_alias(alias, entity, normalize_alias=norm, normalize_entity=True) + _store_alias(alias, entity, normalize_alias=norm, normalize_entity=True) + + line = file.readline() + cnt += 1 + + # write all aliases and their entities and count occurrences to file + with open(prior_prob_output, mode='w', encoding='utf8') as outputfile: + outputfile.write("alias" + "|" + "count" + "|" + "entity" + "\n") + for alias, alias_dict in sorted(map_alias_to_link.items(), key=lambda x: x[0]): + for entity, count in sorted(alias_dict.items(), key=lambda x: x[1], reverse=True): + outputfile.write(alias + "|" + str(count) + "|" + entity + "\n") + + +def _store_alias(alias, entity, normalize_alias=False, normalize_entity=True): + alias = alias.strip() + entity = entity.strip() + + # remove everything after # as this is not part of the title but refers to a specific paragraph + if normalize_entity: + # wikipedia titles are always capitalized + entity = _capitalize_first(entity.split("#")[0]) + if normalize_alias: + alias = alias.split("#")[0] + + if alias and entity: + alias_dict = map_alias_to_link.get(alias, dict()) + entity_count = alias_dict.get(entity, 0) + alias_dict[entity] = entity_count + 1 + map_alias_to_link[alias] = alias_dict + + +def get_wp_links(text): + aliases = [] + entities = [] + normalizations = [] + + matches = link_regex.findall(text) + for match in matches: + match = match[2:][:-2].replace("_", " ").strip() + + if ns_regex.match(match): + pass # ignore namespaces at the beginning of the string + + # this is a simple [[link]], with the alias the same as the mention + elif "|" not in match: + aliases.append(match) + entities.append(match) + normalizations.append(True) + + # in wiki format, the link is written as [[entity|alias]] + else: + splits = match.split("|") + entity = splits[0].strip() + alias = splits[1].strip() + # specific wiki format [[alias (specification)|]] + if len(alias) == 0 and "(" in entity: + alias = entity.split("(")[0] + aliases.append(alias) + entities.append(entity) + normalizations.append(False) + else: + aliases.append(alias) + entities.append(entity) + normalizations.append(False) + + return aliases, entities, normalizations + + +def _capitalize_first(text): + if not text: + return None + result = text[0].capitalize() + if len(result) > 0: + result += text[1:] + return result + + +def write_entity_counts(prior_prob_input, count_output, to_print=False): + # Write entity counts for quick access later + entity_to_count = dict() + total_count = 0 + + with open(prior_prob_input, mode='r', encoding='utf8') as prior_file: + # skip header + prior_file.readline() + line = prior_file.readline() + + while line: + splits = line.replace('\n', "").split(sep='|') + # alias = splits[0] + count = int(splits[1]) + entity = splits[2] + + current_count = entity_to_count.get(entity, 0) + entity_to_count[entity] = current_count + count + + total_count += count + + line = prior_file.readline() + + with open(count_output, mode='w', encoding='utf8') as entity_file: + entity_file.write("entity" + "|" + "count" + "\n") + for entity, count in entity_to_count.items(): + entity_file.write(entity + "|" + str(count) + "\n") + + if to_print: + for entity, count in entity_to_count.items(): + print("Entity count:", entity, count) + print("Total count:", total_count) + + +def get_all_frequencies(count_input): + entity_to_count = dict() + with open(count_input, 'r', encoding='utf8') as csvfile: + csvreader = csv.reader(csvfile, delimiter='|') + # skip header + next(csvreader) + for row in csvreader: + entity_to_count[row[0]] = int(row[1]) + + return entity_to_count + diff --git a/examples/pipeline/dummy_entity_linking.py b/examples/pipeline/dummy_entity_linking.py index 88415d040..0e59db304 100644 --- a/examples/pipeline/dummy_entity_linking.py +++ b/examples/pipeline/dummy_entity_linking.py @@ -9,26 +9,26 @@ from spacy.kb import KnowledgeBase def create_kb(vocab): - kb = KnowledgeBase(vocab=vocab) + kb = KnowledgeBase(vocab=vocab, entity_vector_length=1) # adding entities entity_0 = "Q1004791_Douglas" print("adding entity", entity_0) - kb.add_entity(entity=entity_0, prob=0.5) + kb.add_entity(entity=entity_0, prob=0.5, entity_vector=[0]) entity_1 = "Q42_Douglas_Adams" print("adding entity", entity_1) - kb.add_entity(entity=entity_1, prob=0.5) + kb.add_entity(entity=entity_1, prob=0.5, entity_vector=[1]) entity_2 = "Q5301561_Douglas_Haig" print("adding entity", entity_2) - kb.add_entity(entity=entity_2, prob=0.5) + kb.add_entity(entity=entity_2, prob=0.5, entity_vector=[2]) # adding aliases print() alias_0 = "Douglas" print("adding alias", alias_0) - kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.1, 0.6, 0.2]) + kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.6, 0.1, 0.2]) alias_1 = "Douglas Adams" print("adding alias", alias_1) @@ -41,8 +41,12 @@ def create_kb(vocab): def add_el(kb, nlp): - el_pipe = nlp.create_pipe(name='entity_linker', config={"kb": kb}) + el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 64}) + el_pipe.set_kb(kb) nlp.add_pipe(el_pipe, last=True) + nlp.begin_training() + el_pipe.context_weight = 0 + el_pipe.prior_weight = 1 for alias in ["Douglas Adams", "Douglas"]: candidates = nlp.linker.kb.get_candidates(alias) @@ -66,6 +70,6 @@ def add_el(kb, nlp): if __name__ == "__main__": - nlp = spacy.load('en_core_web_sm') - my_kb = create_kb(nlp.vocab) - add_el(my_kb, nlp) + my_nlp = spacy.load('en_core_web_sm') + my_kb = create_kb(my_nlp.vocab) + add_el(my_kb, my_nlp) diff --git a/examples/pipeline/wikidata_entity_linking.py b/examples/pipeline/wikidata_entity_linking.py new file mode 100644 index 000000000..17c2976dd --- /dev/null +++ b/examples/pipeline/wikidata_entity_linking.py @@ -0,0 +1,442 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import random +import datetime +from pathlib import Path + +from bin.wiki_entity_linking import training_set_creator, kb_creator, wikipedia_processor as wp +from bin.wiki_entity_linking.kb_creator import DESC_WIDTH + +import spacy +from spacy.kb import KnowledgeBase +from spacy.util import minibatch, compounding + +""" +Demonstrate how to build a knowledge base from WikiData and run an Entity Linking algorithm. +""" + +ROOT_DIR = Path("C:/Users/Sofie/Documents/data/") +OUTPUT_DIR = ROOT_DIR / 'wikipedia' +TRAINING_DIR = OUTPUT_DIR / 'training_data_nel' + +PRIOR_PROB = OUTPUT_DIR / 'prior_prob.csv' +ENTITY_COUNTS = OUTPUT_DIR / 'entity_freq.csv' +ENTITY_DEFS = OUTPUT_DIR / 'entity_defs.csv' +ENTITY_DESCR = OUTPUT_DIR / 'entity_descriptions.csv' + +KB_FILE = OUTPUT_DIR / 'kb_1' / 'kb' +NLP_1_DIR = OUTPUT_DIR / 'nlp_1' +NLP_2_DIR = OUTPUT_DIR / 'nlp_2' + +# get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/ +WIKIDATA_JSON = ROOT_DIR / 'wikidata' / 'wikidata-20190304-all.json.bz2' + +# get enwiki-latest-pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.org/enwiki/latest/ +ENWIKI_DUMP = ROOT_DIR / 'wikipedia' / 'enwiki-20190320-pages-articles-multistream.xml.bz2' + +# KB construction parameters +MAX_CANDIDATES = 10 +MIN_ENTITY_FREQ = 20 +MIN_PAIR_OCC = 5 + +# model training parameters +EPOCHS = 10 +DROPOUT = 0.5 +LEARN_RATE = 0.005 +L2 = 1e-6 +CONTEXT_WIDTH = 128 + + +def run_pipeline(): + # set the appropriate booleans to define which parts of the pipeline should be re(run) + print("START", datetime.datetime.now()) + print() + nlp_1 = spacy.load('en_core_web_lg') + nlp_2 = None + kb_2 = None + + # one-time methods to create KB and write to file + to_create_prior_probs = False + to_create_entity_counts = False + to_create_kb = False + + # read KB back in from file + to_read_kb = True + to_test_kb = False + + # create training dataset + create_wp_training = False + + # train the EL pipe + train_pipe = True + measure_performance = True + + # test the EL pipe on a simple example + to_test_pipeline = True + + # write the NLP object, read back in and test again + to_write_nlp = True + to_read_nlp = True + test_from_file = False + + # STEP 1 : create prior probabilities from WP (run only once) + if to_create_prior_probs: + print("STEP 1: to_create_prior_probs", datetime.datetime.now()) + wp.read_wikipedia_prior_probs(wikipedia_input=ENWIKI_DUMP, prior_prob_output=PRIOR_PROB) + print() + + # STEP 2 : deduce entity frequencies from WP (run only once) + if to_create_entity_counts: + print("STEP 2: to_create_entity_counts", datetime.datetime.now()) + wp.write_entity_counts(prior_prob_input=PRIOR_PROB, count_output=ENTITY_COUNTS, to_print=False) + print() + + # STEP 3 : create KB and write to file (run only once) + if to_create_kb: + print("STEP 3a: to_create_kb", datetime.datetime.now()) + kb_1 = kb_creator.create_kb(nlp_1, + max_entities_per_alias=MAX_CANDIDATES, + min_entity_freq=MIN_ENTITY_FREQ, + min_occ=MIN_PAIR_OCC, + entity_def_output=ENTITY_DEFS, + entity_descr_output=ENTITY_DESCR, + count_input=ENTITY_COUNTS, + prior_prob_input=PRIOR_PROB, + wikidata_input=WIKIDATA_JSON) + print("kb entities:", kb_1.get_size_entities()) + print("kb aliases:", kb_1.get_size_aliases()) + print() + + print("STEP 3b: write KB and NLP", datetime.datetime.now()) + kb_1.dump(KB_FILE) + nlp_1.to_disk(NLP_1_DIR) + print() + + # STEP 4 : read KB back in from file + if to_read_kb: + print("STEP 4: to_read_kb", datetime.datetime.now()) + nlp_2 = spacy.load(NLP_1_DIR) + kb_2 = KnowledgeBase(vocab=nlp_2.vocab, entity_vector_length=DESC_WIDTH) + kb_2.load_bulk(KB_FILE) + print("kb entities:", kb_2.get_size_entities()) + print("kb aliases:", kb_2.get_size_aliases()) + print() + + # test KB + if to_test_kb: + check_kb(kb_2) + print() + + # STEP 5: create a training dataset from WP + if create_wp_training: + print("STEP 5: create training dataset", datetime.datetime.now()) + training_set_creator.create_training(wikipedia_input=ENWIKI_DUMP, + entity_def_input=ENTITY_DEFS, + training_output=TRAINING_DIR) + + # STEP 6: create and train the entity linking pipe + if train_pipe: + print("STEP 6: training Entity Linking pipe", datetime.datetime.now()) + type_to_int = {label: i for i, label in enumerate(nlp_2.entity.labels)} + print(" -analysing", len(type_to_int), "different entity types") + el_pipe = nlp_2.create_pipe(name='entity_linker', + config={"context_width": CONTEXT_WIDTH, + "pretrained_vectors": nlp_2.vocab.vectors.name, + "type_to_int": type_to_int}) + el_pipe.set_kb(kb_2) + nlp_2.add_pipe(el_pipe, last=True) + + other_pipes = [pipe for pipe in nlp_2.pipe_names if pipe != "entity_linker"] + with nlp_2.disable_pipes(*other_pipes): # only train Entity Linking + optimizer = nlp_2.begin_training() + optimizer.learn_rate = LEARN_RATE + optimizer.L2 = L2 + + # define the size (nr of entities) of training and dev set + train_limit = 5000 + dev_limit = 5000 + + train_data = training_set_creator.read_training(nlp=nlp_2, + training_dir=TRAINING_DIR, + dev=False, + limit=train_limit) + + print("Training on", len(train_data), "articles") + print() + + dev_data = training_set_creator.read_training(nlp=nlp_2, + training_dir=TRAINING_DIR, + dev=True, + limit=dev_limit) + + print("Dev testing on", len(dev_data), "articles") + print() + + if not train_data: + print("Did not find any training data") + else: + for itn in range(EPOCHS): + random.shuffle(train_data) + losses = {} + batches = minibatch(train_data, size=compounding(4.0, 128.0, 1.001)) + batchnr = 0 + + with nlp_2.disable_pipes(*other_pipes): + for batch in batches: + try: + docs, golds = zip(*batch) + nlp_2.update( + docs, + golds, + sgd=optimizer, + drop=DROPOUT, + losses=losses, + ) + batchnr += 1 + except Exception as e: + print("Error updating batch:", e) + + if batchnr > 0: + el_pipe.cfg["context_weight"] = 1 + el_pipe.cfg["prior_weight"] = 1 + dev_acc_context, dev_acc_context_dict = _measure_accuracy(dev_data, el_pipe) + losses['entity_linker'] = losses['entity_linker'] / batchnr + print("Epoch, train loss", itn, round(losses['entity_linker'], 2), + " / dev acc avg", round(dev_acc_context, 3)) + + # STEP 7: measure the performance of our trained pipe on an independent dev set + if len(dev_data) and measure_performance: + print() + print("STEP 7: performance measurement of Entity Linking pipe", datetime.datetime.now()) + print() + + counts, acc_r, acc_r_label, acc_p, acc_p_label, acc_o, acc_o_label = _measure_baselines(dev_data, kb_2) + print("dev counts:", sorted(counts.items(), key=lambda x: x[0])) + print("dev acc oracle:", round(acc_o, 3), [(x, round(y, 3)) for x, y in acc_o_label.items()]) + print("dev acc random:", round(acc_r, 3), [(x, round(y, 3)) for x, y in acc_r_label.items()]) + print("dev acc prior:", round(acc_p, 3), [(x, round(y, 3)) for x, y in acc_p_label.items()]) + + # using only context + el_pipe.cfg["context_weight"] = 1 + el_pipe.cfg["prior_weight"] = 0 + dev_acc_context, dev_acc_context_dict = _measure_accuracy(dev_data, el_pipe) + print("dev acc context avg:", round(dev_acc_context, 3), + [(x, round(y, 3)) for x, y in dev_acc_context_dict.items()]) + + # measuring combined accuracy (prior + context) + el_pipe.cfg["context_weight"] = 1 + el_pipe.cfg["prior_weight"] = 1 + dev_acc_combo, dev_acc_combo_dict = _measure_accuracy(dev_data, el_pipe, error_analysis=False) + print("dev acc combo avg:", round(dev_acc_combo, 3), + [(x, round(y, 3)) for x, y in dev_acc_combo_dict.items()]) + + # STEP 8: apply the EL pipe on a toy example + if to_test_pipeline: + print() + print("STEP 8: applying Entity Linking to toy example", datetime.datetime.now()) + print() + run_el_toy_example(nlp=nlp_2) + + # STEP 9: write the NLP pipeline (including entity linker) to file + if to_write_nlp: + print() + print("STEP 9: testing NLP IO", datetime.datetime.now()) + print() + print("writing to", NLP_2_DIR) + nlp_2.to_disk(NLP_2_DIR) + print() + + # verify that the IO has gone correctly + if to_read_nlp: + print("reading from", NLP_2_DIR) + nlp_3 = spacy.load(NLP_2_DIR) + + print("running toy example with NLP 3") + run_el_toy_example(nlp=nlp_3) + + # testing performance with an NLP model from file + if test_from_file: + nlp_2 = spacy.load(NLP_1_DIR) + nlp_3 = spacy.load(NLP_2_DIR) + el_pipe = nlp_3.get_pipe("entity_linker") + + dev_limit = 5000 + dev_data = training_set_creator.read_training(nlp=nlp_2, + training_dir=TRAINING_DIR, + dev=True, + limit=dev_limit) + + print("Dev testing from file on", len(dev_data), "articles") + print() + + dev_acc_combo, dev_acc_combo_dict = _measure_accuracy(dev_data, el_pipe=el_pipe, error_analysis=False) + print("dev acc combo avg:", round(dev_acc_combo, 3), + [(x, round(y, 3)) for x, y in dev_acc_combo_dict.items()]) + + print() + print("STOP", datetime.datetime.now()) + + +def _measure_accuracy(data, el_pipe=None, error_analysis=False): + # If the docs in the data require further processing with an entity linker, set el_pipe + correct_by_label = dict() + incorrect_by_label = dict() + + docs = [d for d, g in data if len(d) > 0] + if el_pipe is not None: + docs = list(el_pipe.pipe(docs)) + golds = [g for d, g in data if len(d) > 0] + + for doc, gold in zip(docs, golds): + try: + correct_entries_per_article = dict() + for entity in gold.links: + start, end, gold_kb = entity + correct_entries_per_article[str(start) + "-" + str(end)] = gold_kb + + for ent in doc.ents: + ent_label = ent.label_ + pred_entity = ent.kb_id_ + start = ent.start_char + end = ent.end_char + gold_entity = correct_entries_per_article.get(str(start) + "-" + str(end), None) + # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong' + if gold_entity is not None: + if gold_entity == pred_entity: + correct = correct_by_label.get(ent_label, 0) + correct_by_label[ent_label] = correct + 1 + else: + incorrect = incorrect_by_label.get(ent_label, 0) + incorrect_by_label[ent_label] = incorrect + 1 + if error_analysis: + print(ent.text, "in", doc) + print("Predicted", pred_entity, "should have been", gold_entity) + print() + + except Exception as e: + print("Error assessing accuracy", e) + + acc, acc_by_label = calculate_acc(correct_by_label, incorrect_by_label) + return acc, acc_by_label + + +def _measure_baselines(data, kb): + # Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound + counts_by_label = dict() + + random_correct_by_label = dict() + random_incorrect_by_label = dict() + + oracle_correct_by_label = dict() + oracle_incorrect_by_label = dict() + + prior_correct_by_label = dict() + prior_incorrect_by_label = dict() + + docs = [d for d, g in data if len(d) > 0] + golds = [g for d, g in data if len(d) > 0] + + for doc, gold in zip(docs, golds): + try: + correct_entries_per_article = dict() + for entity in gold.links: + start, end, gold_kb = entity + correct_entries_per_article[str(start) + "-" + str(end)] = gold_kb + + for ent in doc.ents: + ent_label = ent.label_ + start = ent.start_char + end = ent.end_char + gold_entity = correct_entries_per_article.get(str(start) + "-" + str(end), None) + + # the gold annotations are not complete so we can't evaluate missing annotations as 'wrong' + if gold_entity is not None: + counts_by_label[ent_label] = counts_by_label.get(ent_label, 0) + 1 + candidates = kb.get_candidates(ent.text) + oracle_candidate = "" + best_candidate = "" + random_candidate = "" + if candidates: + scores = [] + + for c in candidates: + scores.append(c.prior_prob) + if c.entity_ == gold_entity: + oracle_candidate = c.entity_ + + best_index = scores.index(max(scores)) + best_candidate = candidates[best_index].entity_ + random_candidate = random.choice(candidates).entity_ + + if gold_entity == best_candidate: + prior_correct_by_label[ent_label] = prior_correct_by_label.get(ent_label, 0) + 1 + else: + prior_incorrect_by_label[ent_label] = prior_incorrect_by_label.get(ent_label, 0) + 1 + + if gold_entity == random_candidate: + random_correct_by_label[ent_label] = random_correct_by_label.get(ent_label, 0) + 1 + else: + random_incorrect_by_label[ent_label] = random_incorrect_by_label.get(ent_label, 0) + 1 + + if gold_entity == oracle_candidate: + oracle_correct_by_label[ent_label] = oracle_correct_by_label.get(ent_label, 0) + 1 + else: + oracle_incorrect_by_label[ent_label] = oracle_incorrect_by_label.get(ent_label, 0) + 1 + + except Exception as e: + print("Error assessing accuracy", e) + + acc_prior, acc_prior_by_label = calculate_acc(prior_correct_by_label, prior_incorrect_by_label) + acc_rand, acc_rand_by_label = calculate_acc(random_correct_by_label, random_incorrect_by_label) + acc_oracle, acc_oracle_by_label = calculate_acc(oracle_correct_by_label, oracle_incorrect_by_label) + + return counts_by_label, acc_rand, acc_rand_by_label, acc_prior, acc_prior_by_label, acc_oracle, acc_oracle_by_label + + +def calculate_acc(correct_by_label, incorrect_by_label): + acc_by_label = dict() + total_correct = 0 + total_incorrect = 0 + all_keys = set() + all_keys.update(correct_by_label.keys()) + all_keys.update(incorrect_by_label.keys()) + for label in sorted(all_keys): + correct = correct_by_label.get(label, 0) + incorrect = incorrect_by_label.get(label, 0) + total_correct += correct + total_incorrect += incorrect + if correct == incorrect == 0: + acc_by_label[label] = 0 + else: + acc_by_label[label] = correct / (correct + incorrect) + acc = 0 + if not (total_correct == total_incorrect == 0): + acc = total_correct / (total_correct + total_incorrect) + return acc, acc_by_label + + +def check_kb(kb): + for mention in ("Bush", "Douglas Adams", "Homer", "Brazil", "China"): + candidates = kb.get_candidates(mention) + + print("generating candidates for " + mention + " :") + for c in candidates: + print(" ", c.prior_prob, c.alias_, "-->", c.entity_ + " (freq=" + str(c.entity_freq) + ")") + print() + + +def run_el_toy_example(nlp): + text = "In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, " \ + "Douglas reminds us to always bring our towel, even in China or Brazil. " \ + "The main character in Doug's novel is the man Arthur Dent, " \ + "but Douglas doesn't write about George Washington or Homer Simpson." + doc = nlp(text) + print(text) + for ent in doc.ents: + print(" ent", ent.text, ent.label_, ent.kb_id_) + print() + + +if __name__ == "__main__": + run_pipeline() diff --git a/pyproject.toml b/pyproject.toml index 80bb5905a..35f3d9215 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -5,6 +5,6 @@ requires = ["setuptools", "cymem>=2.0.2,<2.1.0", "preshed>=2.0.1,<2.1.0", "murmurhash>=0.28.0,<1.1.0", - "thinc==7.0.0.dev6", + "thinc>=7.0.8,<7.1.0", ] build-backend = "setuptools.build_meta" diff --git a/requirements.txt b/requirements.txt index 8cc52dfe4..5a6870cd3 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,7 +1,7 @@ # Our libraries cymem>=2.0.2,<2.1.0 preshed>=2.0.1,<2.1.0 -thinc>=7.0.2,<7.1.0 +thinc>=7.0.8,<7.1.0 blis>=0.2.2,<0.3.0 murmurhash>=0.28.0,<1.1.0 wasabi>=0.2.0,<1.1.0 diff --git a/setup.py b/setup.py index 33623588c..b36c48316 100755 --- a/setup.py +++ b/setup.py @@ -228,7 +228,7 @@ def setup_package(): "murmurhash>=0.28.0,<1.1.0", "cymem>=2.0.2,<2.1.0", "preshed>=2.0.1,<2.1.0", - "thinc>=7.0.2,<7.1.0", + "thinc>=7.0.8,<7.1.0", "blis>=0.2.2,<0.3.0", "plac<1.0.0,>=0.9.6", "requests>=2.13.0,<3.0.0", @@ -246,6 +246,7 @@ def setup_package(): "cuda100": ["thinc_gpu_ops>=0.0.1,<0.1.0", "cupy-cuda100>=5.0.0b4"], # Language tokenizers with external dependencies "ja": ["mecab-python3==0.7"], + "ko": ["natto-py==0.9.0"], }, python_requires=">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*", classifiers=[ diff --git a/spacy/_ml.py b/spacy/_ml.py index 349b88df9..4d9bb4c2b 100644 --- a/spacy/_ml.py +++ b/spacy/_ml.py @@ -24,7 +24,7 @@ from thinc.neural._classes.affine import _set_dimensions_if_needed import thinc.extra.load_nlp from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE -from .errors import Errors +from .errors import Errors, user_warning, Warnings from . import util try: @@ -299,7 +299,17 @@ def link_vectors_to_models(vocab): data = ops.asarray(vectors.data) # Set an entry here, so that vectors are accessed by StaticVectors # (unideal, I know) - thinc.extra.load_nlp.VECTORS[(ops.device, vectors.name)] = data + key = (ops.device, vectors.name) + if key in thinc.extra.load_nlp.VECTORS: + if thinc.extra.load_nlp.VECTORS[key].shape != data.shape: + # This is a hack to avoid the problem in #3853. Maybe we should + # print a warning as well? + old_name = vectors.name + new_name = vectors.name + "_%d" % data.shape[0] + user_warning(Warnings.W019.format(old=old_name, new=new_name)) + vectors.name = new_name + key = (ops.device, vectors.name) + thinc.extra.load_nlp.VECTORS[key] = data def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): @@ -652,6 +662,51 @@ def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False, return model +def build_nel_encoder(embed_width, hidden_width, ner_types, **cfg): + # TODO proper error + if "entity_width" not in cfg: + raise ValueError("entity_width not found") + if "context_width" not in cfg: + raise ValueError("context_width not found") + + conv_depth = cfg.get("conv_depth", 2) + cnn_maxout_pieces = cfg.get("cnn_maxout_pieces", 3) + pretrained_vectors = cfg.get("pretrained_vectors") # self.nlp.vocab.vectors.name + context_width = cfg.get("context_width") + entity_width = cfg.get("entity_width") + + with Model.define_operators({">>": chain, "**": clone}): + model = ( + Affine(entity_width, entity_width + context_width + 1 + ner_types) + >> Affine(1, entity_width, drop_factor=0.0) + >> logistic + ) + + # context encoder + tok2vec = ( + Tok2Vec( + width=hidden_width, + embed_size=embed_width, + pretrained_vectors=pretrained_vectors, + cnn_maxout_pieces=cnn_maxout_pieces, + subword_features=True, + conv_depth=conv_depth, + bilstm_depth=0, + ) + >> flatten_add_lengths + >> Pooling(mean_pool) + >> Residual(zero_init(Maxout(hidden_width, hidden_width))) + >> zero_init(Affine(context_width, hidden_width)) + ) + + model.tok2vec = tok2vec + + model.tok2vec = tok2vec + model.tok2vec.nO = context_width + model.nO = 1 + return model + + @layerize def flatten(seqs, drop=0.0): ops = Model.ops diff --git a/spacy/about.py b/spacy/about.py index 5e7093606..16e5e9522 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -4,13 +4,13 @@ # fmt: off __title__ = "spacy" -__version__ = "2.1.4" +__version__ = "2.1.6" __summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython" __uri__ = "https://spacy.io" __author__ = "Explosion AI" __email__ = "contact@explosion.ai" __license__ = "MIT" -__release__ = False +__release__ = True __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" diff --git a/spacy/attrs.pxd b/spacy/attrs.pxd index 79a177ba9..d9aca078c 100644 --- a/spacy/attrs.pxd +++ b/spacy/attrs.pxd @@ -1,4 +1,6 @@ # Reserve 64 values for flag features +from . cimport symbols + cdef enum attr_id_t: NULL_ATTR IS_ALPHA @@ -88,3 +90,4 @@ cdef enum attr_id_t: PROB LANG + ENT_KB_ID = symbols.ENT_KB_ID diff --git a/spacy/attrs.pyx b/spacy/attrs.pyx index ed1f39a3f..8eeea363f 100644 --- a/spacy/attrs.pyx +++ b/spacy/attrs.pyx @@ -84,6 +84,7 @@ IDS = { "DEP": DEP, "ENT_IOB": ENT_IOB, "ENT_TYPE": ENT_TYPE, + "ENT_KB_ID": ENT_KB_ID, "HEAD": HEAD, "SENT_START": SENT_START, "SPACY": SPACY, diff --git a/spacy/cli/pretrain.py b/spacy/cli/pretrain.py index 2fe5b247a..57c26fcbd 100644 --- a/spacy/cli/pretrain.py +++ b/spacy/cli/pretrain.py @@ -5,6 +5,7 @@ import plac import random import numpy import time +import re from collections import Counter from pathlib import Path from thinc.v2v import Affine, Maxout @@ -65,6 +66,13 @@ from .train import _load_pretrained_tok2vec "t2v", Path, ), + epoch_start=( + "The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been " + "renamed. Prevents unintended overwriting of existing weight files.", + "option", + "es", + int + ), ) def pretrain( texts_loc, @@ -83,6 +91,7 @@ def pretrain( seed=0, n_save_every=None, init_tok2vec=None, + epoch_start=None, ): """ Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components, @@ -151,9 +160,29 @@ def pretrain( if init_tok2vec is not None: components = _load_pretrained_tok2vec(nlp, init_tok2vec) msg.text("Loaded pretrained tok2vec for: {}".format(components)) + # Parse the epoch number from the given weight file + model_name = re.search(r"model\d+\.bin", str(init_tok2vec)) + if model_name: + # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin' + epoch_start = int(model_name.group(0)[5:][:-4]) + 1 + else: + if not epoch_start: + msg.fail( + "You have to use the '--epoch-start' argument when using a renamed weight file for " + "'--init-tok2vec'", exits=True + ) + elif epoch_start < 0: + msg.fail( + "The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid" % epoch_start, + exits=True + ) + else: + # Without '--init-tok2vec' the '--epoch-start' argument is ignored + epoch_start = 0 + optimizer = create_default_optimizer(model.ops) tracker = ProgressTracker(frequency=10000) - msg.divider("Pre-training tok2vec layer") + msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start) row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")} msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings) @@ -174,7 +203,7 @@ def pretrain( file_.write(srsly.json_dumps(log) + "\n") skip_counter = 0 - for epoch in range(n_iter): + for epoch in range(epoch_start, n_iter + epoch_start): for batch_id, batch in enumerate( util.minibatch_by_words(((text, None) for text in texts), size=batch_size) ): @@ -272,7 +301,7 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"): elif objective == "cosine": loss, d_target = get_cossim_loss(prediction, target) else: - raise ValueError(Errors.E139.format(loss_func=objective)) + raise ValueError(Errors.E142.format(loss_func=objective)) return loss, d_target diff --git a/spacy/errors.py b/spacy/errors.py index 176003e79..ed3d6afb9 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -82,6 +82,8 @@ class Warnings(object): "parallel inference via multiprocessing.") W017 = ("Alias '{alias}' already exists in the Knowledge base.") W018 = ("Entity '{entity}' already exists in the Knowledge base.") + W019 = ("Changing vectors name from {old} to {new}, to avoid clash with " + "previously loaded vectors. See Issue #3853.") @add_codes @@ -399,7 +401,11 @@ class Errors(object): E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input includes either the " "`text` or `tokens` key. For more info, see the docs:\n" "https://spacy.io/api/cli#pretrain-jsonl") - E139 = ("Unsupported loss_function '{loss_func}'. Use either 'L2' or 'cosine'") + E139 = ("Knowledge base for component '{name}' not initialized. Did you forget to call set_kb()?") + E140 = ("The list of entities, prior probabilities and entity vectors should be of equal length.") + E141 = ("Entity vectors should be of length {required} instead of the provided {found}.") + E142 = ("Unsupported loss_function '{loss_func}'. Use either 'L2' or 'cosine'") + E143 = ("Labels for component '{name}' not initialized. Did you forget to call add_label()?") @add_codes diff --git a/spacy/gold.pxd b/spacy/gold.pxd index a1550b1ef..8943a155a 100644 --- a/spacy/gold.pxd +++ b/spacy/gold.pxd @@ -31,6 +31,7 @@ cdef class GoldParse: cdef public list ents cdef public dict brackets cdef public object cats + cdef public list links cdef readonly list cand_to_gold cdef readonly list gold_to_cand diff --git a/spacy/gold.pyx b/spacy/gold.pyx index 569979a5f..4fb22f3f0 100644 --- a/spacy/gold.pyx +++ b/spacy/gold.pyx @@ -427,7 +427,7 @@ cdef class GoldParse: def __init__(self, doc, annot_tuples=None, words=None, tags=None, heads=None, deps=None, entities=None, make_projective=False, - cats=None, **_): + cats=None, links=None, **_): """Create a GoldParse. doc (Doc): The document the annotations refer to. @@ -450,6 +450,8 @@ cdef class GoldParse: examples of a label to have the value 0.0. Labels not in the dictionary are treated as missing - the gradient for those labels will be zero. + links (iterable): A sequence of `(start_char, end_char, kb_id)` tuples, + representing the external ID of an entity in a knowledge base. RETURNS (GoldParse): The newly constructed object. """ if words is None: @@ -485,6 +487,7 @@ cdef class GoldParse: self.c.ner = <Transition*>self.mem.alloc(len(doc), sizeof(Transition)) self.cats = {} if cats is None else dict(cats) + self.links = links self.words = [None] * len(doc) self.tags = [None] * len(doc) self.heads = [None] * len(doc) diff --git a/spacy/kb.pxd b/spacy/kb.pxd index e34a0a9ba..40b22b275 100644 --- a/spacy/kb.pxd +++ b/spacy/kb.pxd @@ -1,53 +1,27 @@ """Knowledge-base for entity or concept linking.""" from cymem.cymem cimport Pool from preshed.maps cimport PreshMap + from libcpp.vector cimport vector from libc.stdint cimport int32_t, int64_t +from libc.stdio cimport FILE from spacy.vocab cimport Vocab from .typedefs cimport hash_t - -# Internal struct, for storage and disambiguation. This isn't what we return -# to the user as the answer to "here's your entity". It's the minimum number -# of bits we need to keep track of the answers. -cdef struct _EntryC: - - # The hash of this entry's unique ID and name in the kB - hash_t entity_hash - - # Allows retrieval of one or more vectors. - # Each element of vector_rows should be an index into a vectors table. - # Every entry should have the same number of vectors, so we can avoid storing - # the number of vectors in each knowledge-base struct - int32_t* vector_rows - - # Allows retrieval of a struct of non-vector features. We could make this a - # pointer, but we have 32 bits left over in the struct after prob, so we'd - # like this to only be 32 bits. We can also set this to -1, for the common - # case where there are no features. - int32_t feats_row - - # log probability of entity, based on corpus frequency - float prob - - -# Each alias struct stores a list of Entry pointers with their prior probabilities -# for this specific mention/alias. -cdef struct _AliasC: - - # All entry candidates for this alias - vector[int64_t] entry_indices - - # Prior probability P(entity|alias) - should sum up to (at most) 1. - vector[float] probs +from .structs cimport KBEntryC, AliasC +ctypedef vector[KBEntryC] entry_vec +ctypedef vector[AliasC] alias_vec +ctypedef vector[float] float_vec +ctypedef vector[float_vec] float_matrix # Object used by the Entity Linker that summarizes one entity-alias candidate combination. cdef class Candidate: - cdef readonly KnowledgeBase kb cdef hash_t entity_hash + cdef float entity_freq + cdef vector[float] entity_vector cdef hash_t alias_hash cdef float prior_prob @@ -55,9 +29,10 @@ cdef class Candidate: cdef class KnowledgeBase: cdef Pool mem cpdef readonly Vocab vocab + cdef int64_t entity_vector_length # This maps 64bit keys (hash of unique entity string) - # to 64bit values (position of the _EntryC struct in the _entries vector). + # to 64bit values (position of the _KBEntryC struct in the _entries vector). # The PreshMap is pretty space efficient, as it uses open addressing. So # the only overhead is the vacancy rate, which is approximately 30%. cdef PreshMap _entry_index @@ -66,7 +41,7 @@ cdef class KnowledgeBase: # over allocation. # In total we end up with (N*128*1.3)+(N*128*1.3) bits for N entries. # Storing 1m entries would take 41.6mb under this scheme. - cdef vector[_EntryC] _entries + cdef entry_vec _entries # This maps 64bit keys (hash of unique alias string) # to 64bit values (position of the _AliasC struct in the _aliases_table vector). @@ -76,7 +51,7 @@ cdef class KnowledgeBase: # should be P(entity | mention), which is pretty important to know. # We can pack both pieces of information into a 64-bit value, to keep things # efficient. - cdef vector[_AliasC] _aliases_table + cdef alias_vec _aliases_table # This is the part which might take more space: storing various # categorical features for the entries, and storing vectors for disambiguation @@ -87,7 +62,7 @@ cdef class KnowledgeBase: # model, that embeds different features of the entities into vectors. We'll # still want some per-entity features, like the Wikipedia text or entity # co-occurrence. Hopefully those vectors can be narrow, e.g. 64 dimensions. - cdef object _vectors_table + cdef float_matrix _vectors_table # It's very useful to track categorical features, at least for output, even # if they're not useful in the model itself. For instance, we should be @@ -96,53 +71,102 @@ cdef class KnowledgeBase: # optional data, we can let users configure a DB as the backend for this. cdef object _features_table + + cdef inline int64_t c_add_vector(self, vector[float] entity_vector) nogil: + """Add an entity vector to the vectors table.""" + cdef int64_t new_index = self._vectors_table.size() + self._vectors_table.push_back(entity_vector) + return new_index + + cdef inline int64_t c_add_entity(self, hash_t entity_hash, float prob, - int32_t* vector_rows, int feats_row): - """Add an entry to the knowledge base.""" - # This is what we'll map the hash key to. It's where the entry will sit + int32_t vector_index, int feats_row) nogil: + """Add an entry to the vector of entries. + After calling this method, make sure to update also the _entry_index using the return value""" + # This is what we'll map the entity hash key to. It's where the entry will sit # in the vector of entries, so we can get it later. cdef int64_t new_index = self._entries.size() - self._entries.push_back( - _EntryC( - entity_hash=entity_hash, - vector_rows=vector_rows, - feats_row=feats_row, - prob=prob - )) - self._entry_index[entity_hash] = new_index + + # Avoid struct initializer to enable nogil, cf https://github.com/cython/cython/issues/1642 + cdef KBEntryC entry + entry.entity_hash = entity_hash + entry.vector_index = vector_index + entry.feats_row = feats_row + entry.prob = prob + + self._entries.push_back(entry) return new_index - cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs): - """Connect a mention to a list of potential entities with their prior probabilities .""" + cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs) nogil: + """Connect a mention to a list of potential entities with their prior probabilities . + After calling this method, make sure to update also the _alias_index using the return value""" + # This is what we'll map the alias hash key to. It's where the alias will be defined + # in the vector of aliases. cdef int64_t new_index = self._aliases_table.size() - self._aliases_table.push_back( - _AliasC( - entry_indices=entry_indices, - probs=probs - )) - self._alias_index[alias_hash] = new_index + # Avoid struct initializer to enable nogil + cdef AliasC alias + alias.entry_indices = entry_indices + alias.probs = probs + + self._aliases_table.push_back(alias) return new_index - cdef inline _create_empty_vectors(self): + cdef inline void _create_empty_vectors(self, hash_t dummy_hash) nogil: """ - Making sure the first element of each vector is a dummy, + Initializing the vectors and making sure the first element of each vector is a dummy, because the PreshMap maps pointing to indices in these vectors can not contain 0 as value cf. https://github.com/explosion/preshed/issues/17 """ cdef int32_t dummy_value = 0 - self.vocab.strings.add("") - self._entries.push_back( - _EntryC( - entity_hash=self.vocab.strings[""], - vector_rows=&dummy_value, - feats_row=dummy_value, - prob=dummy_value - )) - self._aliases_table.push_back( - _AliasC( - entry_indices=[dummy_value], - probs=[dummy_value] - )) + + # Avoid struct initializer to enable nogil + cdef KBEntryC entry + entry.entity_hash = dummy_hash + entry.vector_index = dummy_value + entry.feats_row = dummy_value + entry.prob = dummy_value + + # Avoid struct initializer to enable nogil + cdef vector[int64_t] dummy_entry_indices + dummy_entry_indices.push_back(0) + cdef vector[float] dummy_probs + dummy_probs.push_back(0) + + cdef AliasC alias + alias.entry_indices = dummy_entry_indices + alias.probs = dummy_probs + + self._entries.push_back(entry) + self._aliases_table.push_back(alias) + + cpdef load_bulk(self, loc) + cpdef set_entities(self, entity_list, prob_list, vector_list) +cdef class Writer: + cdef FILE* _fp + + cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1 + cdef int write_vector_element(self, float element) except -1 + cdef int write_entry(self, hash_t entry_hash, float entry_prob, int32_t vector_index) except -1 + + cdef int write_alias_length(self, int64_t alias_length) except -1 + cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1 + cdef int write_alias(self, int64_t entry_index, float prob) except -1 + + cdef int _write(self, void* value, size_t size) except -1 + +cdef class Reader: + cdef FILE* _fp + + cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1 + cdef int read_vector_element(self, float* element) except -1 + cdef int read_entry(self, hash_t* entity_hash, float* prob, int32_t* vector_index) except -1 + + cdef int read_alias_length(self, int64_t* alias_length) except -1 + cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1 + cdef int read_alias(self, int64_t* entry_index, float* prob) except -1 + + cdef int _read(self, void* value, size_t size) except -1 + diff --git a/spacy/kb.pyx b/spacy/kb.pyx index 3a0a8b918..7c2daa659 100644 --- a/spacy/kb.pyx +++ b/spacy/kb.pyx @@ -1,13 +1,30 @@ +# cython: infer_types=True # cython: profile=True # coding: utf8 from spacy.errors import Errors, Warnings, user_warning +from pathlib import Path +from cymem.cymem cimport Pool +from preshed.maps cimport PreshMap + +from cpython.exc cimport PyErr_SetFromErrno + +from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek +from libc.stdint cimport int32_t, int64_t + +from .typedefs cimport hash_t + +from os import path +from libcpp.vector cimport vector + cdef class Candidate: - def __init__(self, KnowledgeBase kb, entity_hash, alias_hash, prior_prob): + def __init__(self, KnowledgeBase kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob): self.kb = kb self.entity_hash = entity_hash + self.entity_freq = entity_freq + self.entity_vector = entity_vector self.alias_hash = alias_hash self.prior_prob = prior_prob @@ -19,7 +36,7 @@ cdef class Candidate: @property def entity_(self): """RETURNS (unicode): ID/name of this entity in the KB""" - return self.kb.vocab.strings[self.entity] + return self.kb.vocab.strings[self.entity_hash] @property def alias(self): @@ -29,7 +46,15 @@ cdef class Candidate: @property def alias_(self): """RETURNS (unicode): ID of the original alias""" - return self.kb.vocab.strings[self.alias] + return self.kb.vocab.strings[self.alias_hash] + + @property + def entity_freq(self): + return self.entity_freq + + @property + def entity_vector(self): + return self.entity_vector @property def prior_prob(self): @@ -38,26 +63,41 @@ cdef class Candidate: cdef class KnowledgeBase: - def __init__(self, Vocab vocab): + def __init__(self, Vocab vocab, entity_vector_length): self.vocab = vocab + self.mem = Pool() + self.entity_vector_length = entity_vector_length + self._entry_index = PreshMap() self._alias_index = PreshMap() - self.mem = Pool() - self._create_empty_vectors() + + self.vocab.strings.add("") + self._create_empty_vectors(dummy_hash=self.vocab.strings[""]) + + @property + def entity_vector_length(self): + """RETURNS (uint64): length of the entity vectors""" + return self.entity_vector_length def __len__(self): return self.get_size_entities() def get_size_entities(self): - return self._entries.size() - 1 # not counting dummy element on index 0 + return len(self._entry_index) + + def get_entity_strings(self): + return [self.vocab.strings[x] for x in self._entry_index] def get_size_aliases(self): - return self._aliases_table.size() - 1 # not counting dummy element on index 0 + return len(self._alias_index) - def add_entity(self, unicode entity, float prob=0.5, vectors=None, features=None): + def get_alias_strings(self): + return [self.vocab.strings[x] for x in self._alias_index] + + def add_entity(self, unicode entity, float prob, vector[float] entity_vector): """ - Add an entity to the KB. - Return the hash of the entity ID at the end + Add an entity to the KB, optionally specifying its log probability based on corpus frequency + Return the hash of the entity ID/name at the end. """ cdef hash_t entity_hash = self.vocab.strings.add(entity) @@ -66,40 +106,72 @@ cdef class KnowledgeBase: user_warning(Warnings.W018.format(entity=entity)) return - cdef int32_t dummy_value = 342 - self.c_add_entity(entity_hash=entity_hash, prob=prob, - vector_rows=&dummy_value, feats_row=dummy_value) - # TODO self._vectors_table.get_pointer(vectors), - # self._features_table.get(features)) + # Raise an error if the provided entity vector is not of the correct length + if len(entity_vector) != self.entity_vector_length: + raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length)) + + vector_index = self.c_add_vector(entity_vector=entity_vector) + + new_index = self.c_add_entity(entity_hash=entity_hash, + prob=prob, + vector_index=vector_index, + feats_row=-1) # Features table currently not implemented + self._entry_index[entity_hash] = new_index return entity_hash + cpdef set_entities(self, entity_list, prob_list, vector_list): + if len(entity_list) != len(prob_list) or len(entity_list) != len(vector_list): + raise ValueError(Errors.E140) + + nr_entities = len(entity_list) + self._entry_index = PreshMap(nr_entities+1) + self._entries = entry_vec(nr_entities+1) + + i = 0 + cdef KBEntryC entry + while i < nr_entities: + entity_vector = vector_list[i] + if len(entity_vector) != self.entity_vector_length: + raise ValueError(Errors.E141.format(found=len(entity_vector), required=self.entity_vector_length)) + + entity_hash = self.vocab.strings.add(entity_list[i]) + entry.entity_hash = entity_hash + entry.prob = prob_list[i] + + vector_index = self.c_add_vector(entity_vector=vector_list[i]) + entry.vector_index = vector_index + + entry.feats_row = -1 # Features table currently not implemented + + self._entries[i+1] = entry + self._entry_index[entity_hash] = i+1 + + i += 1 + def add_alias(self, unicode alias, entities, probabilities): """ For a given alias, add its potential entities and prior probabilies to the KB. Return the alias_hash at the end """ - # Throw an error if the length of entities and probabilities are not the same if not len(entities) == len(probabilities): raise ValueError(Errors.E132.format(alias=alias, entities_length=len(entities), probabilities_length=len(probabilities))) - # Throw an error if the probabilities sum up to more than 1 + # Throw an error if the probabilities sum up to more than 1 (allow for some rounding errors) prob_sum = sum(probabilities) - if prob_sum > 1: + if prob_sum > 1.00001: raise ValueError(Errors.E133.format(alias=alias, sum=prob_sum)) cdef hash_t alias_hash = self.vocab.strings.add(alias) - # Return if this alias was added before + # Check whether this alias was added before if alias_hash in self._alias_index: user_warning(Warnings.W017.format(alias=alias)) return - cdef hash_t entity_hash - cdef vector[int64_t] entry_indices cdef vector[float] probs @@ -112,20 +184,295 @@ cdef class KnowledgeBase: entry_indices.push_back(int(entry_index)) probs.push_back(float(prob)) - self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs) + new_index = self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs) + self._alias_index[alias_hash] = new_index return alias_hash - def get_candidates(self, unicode alias): - """ TODO: where to put this functionality ?""" cdef hash_t alias_hash = self.vocab.strings[alias] alias_index = <int64_t>self._alias_index.get(alias_hash) alias_entry = self._aliases_table[alias_index] return [Candidate(kb=self, entity_hash=self._entries[entry_index].entity_hash, + entity_freq=self._entries[entry_index].prob, + entity_vector=self._vectors_table[self._entries[entry_index].vector_index], alias_hash=alias_hash, prior_prob=prob) for (entry_index, prob) in zip(alias_entry.entry_indices, alias_entry.probs) if entry_index != 0] + + + def dump(self, loc): + cdef Writer writer = Writer(loc) + writer.write_header(self.get_size_entities(), self.entity_vector_length) + + # dumping the entity vectors in their original order + i = 0 + for entity_vector in self._vectors_table: + for element in entity_vector: + writer.write_vector_element(element) + i = i+1 + + # dumping the entry records in the order in which they are in the _entries vector. + # index 0 is a dummy object not stored in the _entry_index and can be ignored. + i = 1 + for entry_hash, entry_index in sorted(self._entry_index.items(), key=lambda x: x[1]): + entry = self._entries[entry_index] + assert entry.entity_hash == entry_hash + assert entry_index == i + writer.write_entry(entry.entity_hash, entry.prob, entry.vector_index) + i = i+1 + + writer.write_alias_length(self.get_size_aliases()) + + # dumping the aliases in the order in which they are in the _alias_index vector. + # index 0 is a dummy object not stored in the _aliases_table and can be ignored. + i = 1 + for alias_hash, alias_index in sorted(self._alias_index.items(), key=lambda x: x[1]): + alias = self._aliases_table[alias_index] + assert alias_index == i + + candidate_length = len(alias.entry_indices) + writer.write_alias_header(alias_hash, candidate_length) + + for j in range(0, candidate_length): + writer.write_alias(alias.entry_indices[j], alias.probs[j]) + + i = i+1 + + writer.close() + + cpdef load_bulk(self, loc): + cdef hash_t entity_hash + cdef hash_t alias_hash + cdef int64_t entry_index + cdef float prob + cdef int32_t vector_index + cdef KBEntryC entry + cdef AliasC alias + cdef float vector_element + + cdef Reader reader = Reader(loc) + + # STEP 0: load header and initialize KB + cdef int64_t nr_entities + cdef int64_t entity_vector_length + reader.read_header(&nr_entities, &entity_vector_length) + + self.entity_vector_length = entity_vector_length + self._entry_index = PreshMap(nr_entities+1) + self._entries = entry_vec(nr_entities+1) + self._vectors_table = float_matrix(nr_entities+1) + + # STEP 1: load entity vectors + cdef int i = 0 + cdef int j = 0 + while i < nr_entities: + entity_vector = float_vec(entity_vector_length) + j = 0 + while j < entity_vector_length: + reader.read_vector_element(&vector_element) + entity_vector[j] = vector_element + j = j+1 + self._vectors_table[i] = entity_vector + i = i+1 + + # STEP 2: load entities + # we assume that the entity data was written in sequence + # index 0 is a dummy object not stored in the _entry_index and can be ignored. + i = 1 + while i <= nr_entities: + reader.read_entry(&entity_hash, &prob, &vector_index) + + entry.entity_hash = entity_hash + entry.prob = prob + entry.vector_index = vector_index + entry.feats_row = -1 # Features table currently not implemented + + self._entries[i] = entry + self._entry_index[entity_hash] = i + + i += 1 + + # check that all entities were read in properly + assert nr_entities == self.get_size_entities() + + # STEP 3: load aliases + + cdef int64_t nr_aliases + reader.read_alias_length(&nr_aliases) + self._alias_index = PreshMap(nr_aliases+1) + self._aliases_table = alias_vec(nr_aliases+1) + + cdef int64_t nr_candidates + cdef vector[int64_t] entry_indices + cdef vector[float] probs + + i = 1 + # we assume the alias data was written in sequence + # index 0 is a dummy object not stored in the _entry_index and can be ignored. + while i <= nr_aliases: + reader.read_alias_header(&alias_hash, &nr_candidates) + entry_indices = vector[int64_t](nr_candidates) + probs = vector[float](nr_candidates) + + for j in range(0, nr_candidates): + reader.read_alias(&entry_index, &prob) + entry_indices[j] = entry_index + probs[j] = prob + + alias.entry_indices = entry_indices + alias.probs = probs + + self._aliases_table[i] = alias + self._alias_index[alias_hash] = i + + i += 1 + + # check that all aliases were read in properly + assert nr_aliases == self.get_size_aliases() + + +cdef class Writer: + def __init__(self, object loc): + if path.exists(loc): + assert not path.isdir(loc), "%s is directory." % loc + if isinstance(loc, Path): + loc = bytes(loc) + cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc + self._fp = fopen(<char*>bytes_loc, 'wb') + assert self._fp != NULL + fseek(self._fp, 0, 0) + + def close(self): + cdef size_t status = fclose(self._fp) + assert status == 0 + + cdef int write_header(self, int64_t nr_entries, int64_t entity_vector_length) except -1: + self._write(&nr_entries, sizeof(nr_entries)) + self._write(&entity_vector_length, sizeof(entity_vector_length)) + + cdef int write_vector_element(self, float element) except -1: + self._write(&element, sizeof(element)) + + cdef int write_entry(self, hash_t entry_hash, float entry_prob, int32_t vector_index) except -1: + self._write(&entry_hash, sizeof(entry_hash)) + self._write(&entry_prob, sizeof(entry_prob)) + self._write(&vector_index, sizeof(vector_index)) + # Features table currently not implemented and not written to file + + cdef int write_alias_length(self, int64_t alias_length) except -1: + self._write(&alias_length, sizeof(alias_length)) + + cdef int write_alias_header(self, hash_t alias_hash, int64_t candidate_length) except -1: + self._write(&alias_hash, sizeof(alias_hash)) + self._write(&candidate_length, sizeof(candidate_length)) + + cdef int write_alias(self, int64_t entry_index, float prob) except -1: + self._write(&entry_index, sizeof(entry_index)) + self._write(&prob, sizeof(prob)) + + cdef int _write(self, void* value, size_t size) except -1: + status = fwrite(value, size, 1, self._fp) + assert status == 1, status + + +cdef class Reader: + def __init__(self, object loc): + assert path.exists(loc) + assert not path.isdir(loc) + if isinstance(loc, Path): + loc = bytes(loc) + cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc + self._fp = fopen(<char*>bytes_loc, 'rb') + if not self._fp: + PyErr_SetFromErrno(IOError) + status = fseek(self._fp, 0, 0) # this can be 0 if there is no header + + def __dealloc__(self): + fclose(self._fp) + + cdef int read_header(self, int64_t* nr_entries, int64_t* entity_vector_length) except -1: + status = self._read(nr_entries, sizeof(int64_t)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading header from input file") + + status = self._read(entity_vector_length, sizeof(int64_t)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading header from input file") + + cdef int read_vector_element(self, float* element) except -1: + status = self._read(element, sizeof(float)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading entity vector from input file") + + cdef int read_entry(self, hash_t* entity_hash, float* prob, int32_t* vector_index) except -1: + status = self._read(entity_hash, sizeof(hash_t)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading entity hash from input file") + + status = self._read(prob, sizeof(float)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading entity prob from input file") + + status = self._read(vector_index, sizeof(int32_t)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading entity vector from input file") + + if feof(self._fp): + return 0 + else: + return 1 + + cdef int read_alias_length(self, int64_t* alias_length) except -1: + status = self._read(alias_length, sizeof(int64_t)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading alias length from input file") + + cdef int read_alias_header(self, hash_t* alias_hash, int64_t* candidate_length) except -1: + status = self._read(alias_hash, sizeof(hash_t)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading alias hash from input file") + + status = self._read(candidate_length, sizeof(int64_t)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading candidate length from input file") + + cdef int read_alias(self, int64_t* entry_index, float* prob) except -1: + status = self._read(entry_index, sizeof(int64_t)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading entry index for alias from input file") + + status = self._read(prob, sizeof(float)) + if status < 1: + if feof(self._fp): + return 0 # end of file + raise IOError("error reading prob for entity/alias from input file") + + cdef int _read(self, void* value, size_t size) except -1: + status = fread(value, size, 1, self._fp) + return status + + diff --git a/spacy/lang/char_classes.py b/spacy/lang/char_classes.py index cb2e817d5..fb320b2ff 100644 --- a/spacy/lang/char_classes.py +++ b/spacy/lang/char_classes.py @@ -9,6 +9,8 @@ _bengali = r"\u0980-\u09FF" _hebrew = r"\u0591-\u05F4\uFB1D-\uFB4F" +_hindi = r"\u0900-\u097F" + # Latin standard _latin_u_standard = r"A-Z" _latin_l_standard = r"a-z" @@ -193,7 +195,7 @@ _ukrainian = r"а-щюяіїєґА-ЩЮЯІЇЄҐ" _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper _lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower -_uncased = _bengali + _hebrew + _persian + _sinhala +_uncased = _bengali + _hebrew + _persian + _sinhala + _hindi ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased) ALPHA_LOWER = group_chars(_lower + _uncased) diff --git a/spacy/lang/ko/__init__.py b/spacy/lang/ko/__init__.py new file mode 100644 index 000000000..f5dff75f1 --- /dev/null +++ b/spacy/lang/ko/__init__.py @@ -0,0 +1,120 @@ +# encoding: utf8 +from __future__ import unicode_literals, print_function + +import re +import sys + + +from .stop_words import STOP_WORDS +from .tag_map import TAG_MAP +from ...attrs import LANG +from ...language import Language +from ...tokens import Doc +from ...compat import copy_reg +from ...util import DummyTokenizer +from ...compat import is_python3, is_python_pre_3_5 + +is_python_post_3_7 = is_python3 and sys.version_info[1] >= 7 + +# fmt: off +if is_python_pre_3_5: + from collections import namedtuple + Morpheme = namedtuple("Morpheme", "surface lemma tag") +elif is_python_post_3_7: + from dataclasses import dataclass + + @dataclass(frozen=True) + class Morpheme: + surface: str + lemma: str + tag: str +else: + from typing import NamedTuple + + class Morpheme(NamedTuple): + surface: str + lemma: str + tag: str + + +def try_mecab_import(): + try: + from natto import MeCab + return MeCab + except ImportError: + raise ImportError( + "Korean support requires [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), " + "[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), " + "and [natto-py](https://github.com/buruzaemon/natto-py)" + ) +# fmt: on + + +def check_spaces(text, tokens): + token_pattern = re.compile(r"\s?".join(f"({t})" for t in tokens)) + m = token_pattern.match(text) + if m is not None: + for i in range(1, m.lastindex): + yield m.end(i) < m.start(i + 1) + yield False + + +class KoreanTokenizer(DummyTokenizer): + def __init__(self, cls, nlp=None): + self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) + self.Tokenizer = try_mecab_import() + + def __call__(self, text): + dtokens = list(self.detailed_tokens(text)) + surfaces = [dt.surface for dt in dtokens] + doc = Doc(self.vocab, words=surfaces, spaces=list(check_spaces(text, surfaces))) + for token, dtoken in zip(doc, dtokens): + first_tag, sep, eomi_tags = dtoken.tag.partition("+") + token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미) + token.lemma_ = dtoken.lemma + doc.user_data["full_tags"] = [dt.tag for dt in dtokens] + return doc + + def detailed_tokens(self, text): + # 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3], + # 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], * + with self.Tokenizer("-F%f[0],%f[7]") as tokenizer: + for node in tokenizer.parse(text, as_nodes=True): + if node.is_eos(): + break + surface = node.surface + feature = node.feature + tag, _, expr = feature.partition(",") + lemma, _, remainder = expr.partition("/") + if lemma == "*": + lemma = surface + yield Morpheme(surface, lemma, tag) + + +class KoreanDefaults(Language.Defaults): + lex_attr_getters = dict(Language.Defaults.lex_attr_getters) + lex_attr_getters[LANG] = lambda _text: "ko" + stop_words = STOP_WORDS + tag_map = TAG_MAP + writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} + + @classmethod + def create_tokenizer(cls, nlp=None): + return KoreanTokenizer(cls, nlp) + + +class Korean(Language): + lang = "ko" + Defaults = KoreanDefaults + + def make_doc(self, text): + return self.tokenizer(text) + + +def pickle_korean(instance): + return Korean, tuple() + + +copy_reg.pickle(Korean, pickle_korean) + +__all__ = ["Korean"] diff --git a/spacy/lang/ko/examples.py b/spacy/lang/ko/examples.py new file mode 100644 index 000000000..10a6ea9bd --- /dev/null +++ b/spacy/lang/ko/examples.py @@ -0,0 +1,15 @@ +# coding: utf8 +from __future__ import unicode_literals +""" +Example sentences to test spaCy and its language models. + +>>> from spacy.lang.ko.examples import sentences +>>> docs = nlp.pipe(sentences) +""" + +sentences = [ + "애플이 영국의 신생 기업을 10억 달러에 구매를 고려중이다.", + "자동 운전 자동차의 손해 배상 책임에 자동차 메이커에 일정한 부담을 요구하겠다.", + "자동 배달 로봇이 보도를 주행하는 것을 샌프란시스코시가 금지를 검토중이라고 합니다.", + "런던은 영국의 수도이자 가장 큰 도시입니다." +] diff --git a/spacy/lang/ko/stop_words.py b/spacy/lang/ko/stop_words.py new file mode 100644 index 000000000..53cf6f29a --- /dev/null +++ b/spacy/lang/ko/stop_words.py @@ -0,0 +1,68 @@ +# coding: utf8 +from __future__ import unicode_literals + +STOP_WORDS = set(""" +이 +있 +하 +것 +들 +그 +되 +수 +이 +보 +않 +없 +나 +주 +아니 +등 +같 +때 +년 +가 +한 +지 +오 +말 +일 +그렇 +위하 +때문 +그것 +두 +말하 +알 +그러나 +받 +못하 +일 +그런 +또 +더 +많 +그리고 +좋 +크 +시키 +그러 +하나 +살 +데 +안 +어떤 +번 +나 +다른 +어떻 +들 +이렇 +점 +싶 +말 +좀 +원 +잘 +놓 +""".split()) diff --git a/spacy/lang/ko/tag_map.py b/spacy/lang/ko/tag_map.py new file mode 100644 index 000000000..57317c969 --- /dev/null +++ b/spacy/lang/ko/tag_map.py @@ -0,0 +1,59 @@ +# encoding: utf8 +from __future__ import unicode_literals + +from ...symbols import POS, PUNCT, INTJ, X, SYM, ADJ, AUX, ADP, CONJ, NOUN, PRON +from ...symbols import VERB, ADV, PROPN, NUM, DET + +# 은전한닢(mecab-ko-dic)의 품사 태그를 universal pos tag로 대응시킴 +# https://docs.google.com/spreadsheets/d/1-9blXKjtjeKZqsf4NzHeYJCrr49-nXeRF6D80udfcwY/edit#gid=589544265 +# https://universaldependencies.org/u/pos/ +TAG_MAP = { + # J.{1,2} 조사 + "JKS": {POS: ADP}, + "JKC": {POS: ADP}, + "JKG": {POS: ADP}, + "JKO": {POS: ADP}, + "JKB": {POS: ADP}, + "JKV": {POS: ADP}, + "JKQ": {POS: ADP}, + "JX": {POS: ADP}, # 보조사 + "JC": {POS: CONJ}, # 접속 조사 + "MAJ": {POS: CONJ}, # 접속 부사 + "MAG": {POS: ADV}, # 일반 부사 + "MM": {POS: DET}, # 관형사 + "XPN": {POS: X}, # 접두사 + # XS. 접미사 + "XSN": {POS: X}, + "XSV": {POS: X}, + "XSA": {POS: X}, + "XR": {POS: X}, # 어근 + # E.{1,2} 어미 + "EP": {POS: X}, + "EF": {POS: X}, + "EC": {POS: X}, + "ETN": {POS: X}, + "ETM": {POS: X}, + "IC": {POS: INTJ}, # 감탄사 + "VV": {POS: VERB}, # 동사 + "VA": {POS: ADJ}, # 형용사 + "VX": {POS: AUX}, # 보조 용언 + "VCP": {POS: ADP}, # 긍정 지정사(이다) + "VCN": {POS: ADJ}, # 부정 지정사(아니다) + "NNG": {POS: NOUN}, # 일반 명사(general noun) + "NNB": {POS: NOUN}, # 의존 명사 + "NNBC": {POS: NOUN}, # 의존 명사(단위: unit) + "NNP": {POS: PROPN}, # 고유 명사(proper noun) + "NP": {POS: PRON}, # 대명사 + "NR": {POS: NUM}, # 수사(numerals) + "SN": {POS: NUM}, # 숫자 + # S.{1,2} 부호 + # 문장 부호 + "SF": {POS: PUNCT}, # period or other EOS marker + "SE": {POS: PUNCT}, + "SC": {POS: PUNCT}, # comma, etc. + "SSO": {POS: PUNCT}, # open bracket + "SSC": {POS: PUNCT}, # close bracket + "SY": {POS: SYM}, # 기타 기호 + "SL": {POS: X}, # 외국어 + "SH": {POS: X}, # 한자 +} diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index 2dd8c2940..86658ce99 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -262,13 +262,13 @@ cdef find_matches(TokenPatternC** patterns, int n, Doc doc, extensions=None, cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil: + # There have been a few bugs here. # The code was originally designed to always have pattern[1].attrs.value # be the ent_id when we get to the end of a pattern. However, Issue #2671 # showed this wasn't the case when we had a reject-and-continue before a - # match. I still don't really understand what's going on here, but this - # workaround does resolve the issue. - while pattern.attrs.attr != ID and \ - (pattern.nr_attr > 0 or pattern.nr_extra_attr > 0 or pattern.nr_py > 0): + # match. + # The patch to #2671 was wrong though, which came up in #3839. + while pattern.attrs.attr != ID: pattern += 1 return pattern.attrs.value diff --git a/spacy/pipeline/entityruler.py b/spacy/pipeline/entityruler.py index 4f89e4186..35b465ceb 100644 --- a/spacy/pipeline/entityruler.py +++ b/spacy/pipeline/entityruler.py @@ -10,7 +10,7 @@ from ..util import ensure_path, to_disk, from_disk from ..tokens import Span from ..matcher import Matcher, PhraseMatcher -DEFAULT_ENT_ID_SEP = '||' +DEFAULT_ENT_ID_SEP = "||" class EntityRuler(object): @@ -53,7 +53,9 @@ class EntityRuler(object): self.matcher = Matcher(nlp.vocab) if phrase_matcher_attr is not None: self.phrase_matcher_attr = phrase_matcher_attr - self.phrase_matcher = PhraseMatcher(nlp.vocab, attr=self.phrase_matcher_attr) + self.phrase_matcher = PhraseMatcher( + nlp.vocab, attr=self.phrase_matcher_attr + ) else: self.phrase_matcher_attr = None self.phrase_matcher = PhraseMatcher(nlp.vocab) @@ -223,13 +225,14 @@ class EntityRuler(object): """ cfg = srsly.msgpack_loads(patterns_bytes) if isinstance(cfg, dict): - self.add_patterns(cfg.get('patterns', cfg)) - self.overwrite = cfg.get('overwrite', False) - self.phrase_matcher_attr = cfg.get('phrase_matcher_attr', None) + self.add_patterns(cfg.get("patterns", cfg)) + self.overwrite = cfg.get("overwrite", False) + self.phrase_matcher_attr = cfg.get("phrase_matcher_attr", None) if self.phrase_matcher_attr is not None: - self.phrase_matcher = PhraseMatcher(self.nlp.vocab, - attr=self.phrase_matcher_attr) - self.ent_id_sep = cfg.get('ent_id_sep', DEFAULT_ENT_ID_SEP) + self.phrase_matcher = PhraseMatcher( + self.nlp.vocab, attr=self.phrase_matcher_attr + ) + self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP) else: self.add_patterns(cfg) return self @@ -242,11 +245,14 @@ class EntityRuler(object): DOCS: https://spacy.io/api/entityruler#to_bytes """ - serial = OrderedDict(( - ('overwrite', self.overwrite), - ('ent_id_sep', self.ent_id_sep), - ('phrase_matcher_attr', self.phrase_matcher_attr), - ('patterns', self.patterns))) + serial = OrderedDict( + ( + ("overwrite", self.overwrite), + ("ent_id_sep", self.ent_id_sep), + ("phrase_matcher_attr", self.phrase_matcher_attr), + ("patterns", self.patterns), + ) + ) return srsly.msgpack_dumps(serial) def from_disk(self, path, **kwargs): @@ -260,42 +266,52 @@ class EntityRuler(object): DOCS: https://spacy.io/api/entityruler#from_disk """ path = ensure_path(path) - if path.is_file(): - patterns = srsly.read_jsonl(path) + depr_patterns_path = path.with_suffix(".jsonl") + if depr_patterns_path.is_file(): + patterns = srsly.read_jsonl(depr_patterns_path) self.add_patterns(patterns) else: cfg = {} deserializers = { - 'patterns': lambda p: self.add_patterns(srsly.read_jsonl(p.with_suffix('.jsonl'))), - 'cfg': lambda p: cfg.update(srsly.read_json(p)) + "patterns": lambda p: self.add_patterns( + srsly.read_jsonl(p.with_suffix(".jsonl")) + ), + "cfg": lambda p: cfg.update(srsly.read_json(p)), } from_disk(path, deserializers, {}) - self.overwrite = cfg.get('overwrite', False) - self.phrase_matcher_attr = cfg.get('phrase_matcher_attr') - self.ent_id_sep = cfg.get('ent_id_sep', DEFAULT_ENT_ID_SEP) + self.overwrite = cfg.get("overwrite", False) + self.phrase_matcher_attr = cfg.get("phrase_matcher_attr") + self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP) if self.phrase_matcher_attr is not None: - self.phrase_matcher = PhraseMatcher(self.nlp.vocab, - attr=self.phrase_matcher_attr) + self.phrase_matcher = PhraseMatcher( + self.nlp.vocab, attr=self.phrase_matcher_attr + ) return self def to_disk(self, path, **kwargs): """Save the entity ruler patterns to a directory. The patterns will be saved as newline-delimited JSON (JSONL). - path (unicode / Path): The JSONL file to load. + path (unicode / Path): The JSONL file to save. **kwargs: Other config paramters, mostly for consistency. RETURNS (EntityRuler): The loaded entity ruler. DOCS: https://spacy.io/api/entityruler#to_disk """ - cfg = {'overwrite': self.overwrite, - 'phrase_matcher_attr': self.phrase_matcher_attr, - 'ent_id_sep': self.ent_id_sep} - serializers = { - 'patterns': lambda p: srsly.write_jsonl(p.with_suffix('.jsonl'), - self.patterns), - 'cfg': lambda p: srsly.write_json(p, cfg) - } path = ensure_path(path) - to_disk(path, serializers, {}) + cfg = { + "overwrite": self.overwrite, + "phrase_matcher_attr": self.phrase_matcher_attr, + "ent_id_sep": self.ent_id_sep, + } + serializers = { + "patterns": lambda p: srsly.write_jsonl( + p.with_suffix(".jsonl"), self.patterns + ), + "cfg": lambda p: srsly.write_json(p, cfg), + } + if path.suffix == ".jsonl": # user wants to save only JSONL + srsly.write_jsonl(path, self.patterns) + else: + to_disk(path, serializers, {}) diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx index 1d4eeadce..891e8d4e3 100644 --- a/spacy/pipeline/pipes.pyx +++ b/spacy/pipeline/pipes.pyx @@ -3,16 +3,18 @@ # coding: utf8 from __future__ import unicode_literals -cimport numpy as np - import numpy import srsly +import random from collections import OrderedDict from thinc.api import chain from thinc.v2v import Affine, Maxout, Softmax from thinc.misc import LayerNorm -from thinc.neural.util import to_categorical, copy_array +from thinc.neural.util import to_categorical +from thinc.neural.util import get_array_module +from spacy.kb import KnowledgeBase +from ..cli.pretrain import get_cossim_loss from .functions import merge_subtokens from ..tokens.doc cimport Doc from ..syntax.nn_parser cimport Parser @@ -24,9 +26,9 @@ from ..vocab cimport Vocab from ..syntax import nonproj from ..attrs import POS, ID from ..parts_of_speech import X -from .._ml import Tok2Vec, build_tagger_model +from .._ml import Tok2Vec, build_tagger_model, cosine from .._ml import build_text_classifier, build_simple_cnn_text_classifier -from .._ml import build_bow_text_classifier +from .._ml import build_bow_text_classifier, build_nel_encoder from .._ml import link_vectors_to_models, zero_init, flatten from .._ml import masked_language_model, create_default_optimizer from ..errors import Errors, TempErrors @@ -229,7 +231,7 @@ class Tensorizer(Pipe): vocab (Vocab): A `Vocab` instance. The model must share the same `Vocab` instance with the `Doc` objects it will process. - model (Model): A `Model` instance or `True` allocate one later. + model (Model): A `Model` instance or `True` to allocate one later. **cfg: Config parameters. EXAMPLE: @@ -294,7 +296,7 @@ class Tensorizer(Pipe): docs (iterable): A batch of `Doc` objects. golds (iterable): A batch of `GoldParse` objects. - drop (float): The droput rate. + drop (float): The dropout rate. sgd (callable): An optimizer. RETURNS (dict): Results from the update. """ @@ -386,7 +388,7 @@ class Tagger(Pipe): def predict(self, docs): self.require_model() if not any(len(doc) for doc in docs): - # Handle case where there are no tokens in any docs. + # Handle cases where there are no tokens in any docs. n_labels = len(self.labels) guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs] tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO)) @@ -900,6 +902,11 @@ class TextCategorizer(Pipe): def labels(self): return tuple(self.cfg.setdefault("labels", [])) + def require_labels(self): + """Raise an error if the component's model has no labels defined.""" + if not self.labels: + raise ValueError(Errors.E143.format(name=self.name)) + @labels.setter def labels(self, value): self.cfg["labels"] = tuple(value) @@ -929,6 +936,7 @@ class TextCategorizer(Pipe): doc.cats[label] = float(scores[i, j]) def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None): + self.require_model() scores, bp_scores = self.model.begin_update(docs, drop=drop) loss, d_scores = self.get_loss(docs, golds, scores) bp_scores(d_scores, sgd=sgd) @@ -983,6 +991,7 @@ class TextCategorizer(Pipe): def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): if self.model is True: self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") + self.require_labels() self.model = self.Model(len(self.labels), **self.cfg) link_vectors_to_models(self.vocab) if sgd is None: @@ -1001,7 +1010,7 @@ cdef class DependencyParser(Parser): @property def postprocesses(self): - return [nonproj.deprojectivize, merge_subtokens] + return [nonproj.deprojectivize] def add_multitask_objective(self, target): if target == "cloze": @@ -1063,52 +1072,252 @@ cdef class EntityRecognizer(Parser): class EntityLinker(Pipe): + """Pipeline component for named entity linking. + + DOCS: TODO + """ name = 'entity_linker' @classmethod - def Model(cls, nr_class=1, **cfg): - # TODO: non-dummy EL implementation - return None + def Model(cls, **cfg): + embed_width = cfg.get("embed_width", 300) + hidden_width = cfg.get("hidden_width", 128) + type_to_int = cfg.get("type_to_int", dict()) - def __init__(self, model=True, **cfg): - self.model = False + model = build_nel_encoder(embed_width=embed_width, hidden_width=hidden_width, ner_types=len(type_to_int), **cfg) + return model + + def __init__(self, vocab, **cfg): + self.vocab = vocab + self.model = True + self.kb = None self.cfg = dict(cfg) - self.kb = self.cfg["kb"] + self.sgd_context = None + + def set_kb(self, kb): + self.kb = kb + + def require_model(self): + # Raise an error if the component's model is not initialized. + if getattr(self, "model", None) in (None, True, False): + raise ValueError(Errors.E109.format(name=self.name)) + + def require_kb(self): + # Raise an error if the knowledge base is not initialized. + if getattr(self, "kb", None) in (None, True, False): + raise ValueError(Errors.E139.format(name=self.name)) + + def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): + self.require_kb() + self.cfg["entity_width"] = self.kb.entity_vector_length + + if self.model is True: + self.model = self.Model(**self.cfg) + self.sgd_context = self.create_optimizer() + + if sgd is None: + sgd = self.create_optimizer() + + return sgd + + def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None): + self.require_model() + self.require_kb() + + if losses is not None: + losses.setdefault(self.name, 0.0) + + if not docs or not golds: + return 0 + + if len(docs) != len(golds): + raise ValueError(Errors.E077.format(value="EL training", n_docs=len(docs), + n_golds=len(golds))) + + if isinstance(docs, Doc): + docs = [docs] + golds = [golds] + + context_docs = [] + entity_encodings = [] + cats = [] + priors = [] + type_vectors = [] + + type_to_int = self.cfg.get("type_to_int", dict()) + + for doc, gold in zip(docs, golds): + ents_by_offset = dict() + for ent in doc.ents: + ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)] = ent + for entity in gold.links: + start, end, gold_kb = entity + mention = doc.text[start:end] + + gold_ent = ents_by_offset[str(ent.start_char) + "_" + str(ent.end_char)] + assert gold_ent is not None + type_vector = [0 for i in range(len(type_to_int))] + if len(type_to_int) > 0: + type_vector[type_to_int[gold_ent.label_]] = 1 + + candidates = self.kb.get_candidates(mention) + random.shuffle(candidates) + nr_neg = 0 + for c in candidates: + kb_id = c.entity_ + entity_encoding = c.entity_vector + entity_encodings.append(entity_encoding) + context_docs.append(doc) + type_vectors.append(type_vector) + + if self.cfg.get("prior_weight", 1) > 0: + priors.append([c.prior_prob]) + else: + priors.append([0]) + + if kb_id == gold_kb: + cats.append([1]) + else: + nr_neg += 1 + cats.append([0]) + + if len(entity_encodings) > 0: + assert len(priors) == len(entity_encodings) == len(context_docs) == len(cats) == len(type_vectors) + + context_encodings, bp_context = self.model.tok2vec.begin_update(context_docs, drop=drop) + entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32") + + mention_encodings = [list(context_encodings[i]) + list(entity_encodings[i]) + priors[i] + type_vectors[i] + for i in range(len(entity_encodings))] + pred, bp_mention = self.model.begin_update(self.model.ops.asarray(mention_encodings, dtype="float32"), drop=drop) + cats = self.model.ops.asarray(cats, dtype="float32") + + loss, d_scores = self.get_loss(prediction=pred, golds=cats, docs=None) + mention_gradient = bp_mention(d_scores, sgd=sgd) + + context_gradients = [list(x[0:self.cfg.get("context_width")]) for x in mention_gradient] + bp_context(self.model.ops.asarray(context_gradients, dtype="float32"), sgd=self.sgd_context) + + if losses is not None: + losses[self.name] += loss + return loss + return 0 + + def get_loss(self, docs, golds, prediction): + d_scores = (prediction - golds) + loss = (d_scores ** 2).sum() + loss = loss / len(golds) + return loss, d_scores + + def get_loss_old(self, docs, golds, scores): + # this loss function assumes we're only using positive examples + loss, gradients = get_cossim_loss(yh=scores, y=golds) + loss = loss / len(golds) + return loss, gradients def __call__(self, doc): - self.set_annotations([doc], scores=None, tensors=None) + entities, kb_ids = self.predict([doc]) + self.set_annotations([doc], entities, kb_ids) return doc def pipe(self, stream, batch_size=128, n_threads=-1): - """Apply the pipe to a stream of documents. - Both __call__ and pipe should delegate to the `predict()` - and `set_annotations()` methods. - """ for docs in util.minibatch(stream, size=batch_size): docs = list(docs) - self.set_annotations(docs, scores=None, tensors=None) + entities, kb_ids = self.predict(docs) + self.set_annotations(docs, entities, kb_ids) yield from docs - def set_annotations(self, docs, scores, tensors=None): - """ - Currently implemented as taking the KB entry with highest prior probability for each named entity - TODO: actually use context etc - """ - for i, doc in enumerate(docs): - for ent in doc.ents: - candidates = self.kb.get_candidates(ent.text) - if candidates: - best_candidate = max(candidates, key=lambda c: c.prior_prob) - for token in ent: - token.ent_kb_id_ = best_candidate.entity_ + def predict(self, docs): + self.require_model() + self.require_kb() - def get_loss(self, docs, golds, scores): - # TODO - pass + final_entities = [] + final_kb_ids = [] + + if not docs: + return final_entities, final_kb_ids + + if isinstance(docs, Doc): + docs = [docs] + + context_encodings = self.model.tok2vec(docs) + xp = get_array_module(context_encodings) + + type_to_int = self.cfg.get("type_to_int", dict()) + + for i, doc in enumerate(docs): + if len(doc) > 0: + context_encoding = context_encodings[i] + for ent in doc.ents: + type_vector = [0 for i in range(len(type_to_int))] + if len(type_to_int) > 0: + type_vector[type_to_int[ent.label_]] = 1 + + candidates = self.kb.get_candidates(ent.text) + if candidates: + random.shuffle(candidates) + + # this will set the prior probabilities to 0 (just like in training) if their weight is 0 + prior_probs = xp.asarray([[c.prior_prob] for c in candidates]) + prior_probs *= self.cfg.get("prior_weight", 1) + scores = prior_probs + + if self.cfg.get("context_weight", 1) > 0: + entity_encodings = xp.asarray([c.entity_vector for c in candidates]) + assert len(entity_encodings) == len(prior_probs) + mention_encodings = [list(context_encoding) + list(entity_encodings[i]) + + list(prior_probs[i]) + type_vector + for i in range(len(entity_encodings))] + scores = self.model(self.model.ops.asarray(mention_encodings, dtype="float32")) + + # TODO: thresholding + best_index = scores.argmax() + best_candidate = candidates[best_index] + final_entities.append(ent) + final_kb_ids.append(best_candidate.entity_) + + return final_entities, final_kb_ids + + def set_annotations(self, docs, entities, kb_ids=None): + for entity, kb_id in zip(entities, kb_ids): + for token in entity: + token.ent_kb_id_ = kb_id + + def to_disk(self, path, exclude=tuple(), **kwargs): + serialize = OrderedDict() + serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg) + serialize["vocab"] = lambda p: self.vocab.to_disk(p) + serialize["kb"] = lambda p: self.kb.dump(p) + if self.model not in (None, True, False): + serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes()) + exclude = util.get_serialization_exclude(serialize, exclude, kwargs) + util.to_disk(path, serialize, exclude) + + def from_disk(self, path, exclude=tuple(), **kwargs): + def load_model(p): + if self.model is True: + self.model = self.Model(**self.cfg) + self.model.from_bytes(p.open("rb").read()) + + def load_kb(p): + kb = KnowledgeBase(vocab=self.vocab, entity_vector_length=self.cfg["entity_width"]) + kb.load_bulk(p) + self.set_kb(kb) + + deserialize = OrderedDict() + deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p)) + deserialize["vocab"] = lambda p: self.vocab.from_disk(p) + deserialize["kb"] = load_kb + deserialize["model"] = load_model + exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) + util.from_disk(path, deserialize, exclude) + return self + + def rehearse(self, docs, sgd=None, losses=None, **config): + raise NotImplementedError def add_label(self, label): - # TODO - pass + raise NotImplementedError class Sentencizer(object): diff --git a/spacy/scorer.py b/spacy/scorer.py index 32716b852..b9994e3f2 100644 --- a/spacy/scorer.py +++ b/spacy/scorer.py @@ -52,6 +52,7 @@ class Scorer(object): self.labelled = PRFScore() self.tags = PRFScore() self.ner = PRFScore() + self.ner_per_ents = dict() self.eval_punct = eval_punct @property @@ -91,6 +92,15 @@ class Scorer(object): """RETURNS (float): Named entity accuracy (F-score).""" return self.ner.fscore * 100 + @property + def ents_per_type(self): + """RETURNS (dict): Scores per entity label. + """ + return { + k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100} + for k, v in self.ner_per_ents.items() + } + @property def scores(self): """RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`, @@ -102,6 +112,7 @@ class Scorer(object): "ents_p": self.ents_p, "ents_r": self.ents_r, "ents_f": self.ents_f, + "ents_per_type": self.ents_per_type, "tags_acc": self.tags_acc, "token_acc": self.token_acc, } @@ -149,13 +160,31 @@ class Scorer(object): cand_deps.add((gold_i, gold_head, token.dep_.lower())) if "-" not in [token[-1] for token in gold.orig_annot]: cand_ents = set() + current_ent = {k.label_: set() for k in doc.ents} + current_gold = {k.label_: set() for k in doc.ents} for ent in doc.ents: + if ent.label_ not in self.ner_per_ents: + self.ner_per_ents[ent.label_] = PRFScore() first = gold.cand_to_gold[ent.start] last = gold.cand_to_gold[ent.end - 1] if first is None or last is None: self.ner.fp += 1 + self.ner_per_ents[ent.label_].fp += 1 else: cand_ents.add((ent.label_, first, last)) + current_ent[ent.label_].add( + tuple(x for x in cand_ents if x[0] == ent.label_) + ) + current_gold[ent.label_].add( + tuple(x for x in gold_ents if x[0] == ent.label_) + ) + # Scores per ent + [ + v.score_set(current_ent[k], current_gold[k]) + for k, v in self.ner_per_ents.items() + if k in current_ent + ] + # Score for all ents self.ner.score_set(cand_ents, gold_ents) self.tags.score_set(cand_tags, gold_tags) self.labelled.score_set(cand_deps, gold_deps) diff --git a/spacy/structs.pxd b/spacy/structs.pxd index 154202c0d..e80b1b4d6 100644 --- a/spacy/structs.pxd +++ b/spacy/structs.pxd @@ -3,6 +3,10 @@ from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t from .typedefs cimport flags_t, attr_t, hash_t from .parts_of_speech cimport univ_pos_t +from libcpp.vector cimport vector +from libc.stdint cimport int32_t, int64_t + + cdef struct LexemeC: flags_t flags @@ -72,3 +76,32 @@ cdef struct TokenC: attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth.. attr_t ent_kb_id hash_t ent_id + + +# Internal struct, for storage and disambiguation of entities. +cdef struct KBEntryC: + + # The hash of this entry's unique ID/name in the kB + hash_t entity_hash + + # Allows retrieval of the entity vector, as an index into a vectors table of the KB. + # Can be expanded later to refer to multiple rows (compositional model to reduce storage footprint). + int32_t vector_index + + # Allows retrieval of a struct of non-vector features. + # This is currently not implemented and set to -1 for the common case where there are no features. + int32_t feats_row + + # log probability of entity, based on corpus frequency + float prob + + +# Each alias struct stores a list of Entry pointers with their prior probabilities +# for this specific mention/alias. +cdef struct AliasC: + + # All entry candidates for this alias + vector[int64_t] entry_indices + + # Prior probability P(entity|alias) - should sum up to (at most) 1. + vector[float] probs diff --git a/spacy/symbols.pxd b/spacy/symbols.pxd index 051b92edb..5922ee588 100644 --- a/spacy/symbols.pxd +++ b/spacy/symbols.pxd @@ -460,3 +460,5 @@ cdef enum symbol_t: xcomp acl + + ENT_KB_ID diff --git a/spacy/symbols.pyx b/spacy/symbols.pyx index 949621820..b65ae9628 100644 --- a/spacy/symbols.pyx +++ b/spacy/symbols.pyx @@ -86,6 +86,7 @@ IDS = { "DEP": DEP, "ENT_IOB": ENT_IOB, "ENT_TYPE": ENT_TYPE, + "ENT_KB_ID": ENT_KB_ID, "HEAD": HEAD, "SENT_START": SENT_START, "SPACY": SPACY, diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index 4bef85a1b..fdd86616d 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -124,6 +124,12 @@ def ja_tokenizer(): return get_lang_class("ja").Defaults.create_tokenizer() +@pytest.fixture(scope="session") +def ko_tokenizer(): + pytest.importorskip("natto") + return get_lang_class("ko").Defaults.create_tokenizer() + + @pytest.fixture(scope="session") def lt_tokenizer(): return get_lang_class("lt").Defaults.create_tokenizer() diff --git a/spacy/tests/lang/ko/__init__.py b/spacy/tests/lang/ko/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/ko/test_lemmatization.py b/spacy/tests/lang/ko/test_lemmatization.py new file mode 100644 index 000000000..42c306c11 --- /dev/null +++ b/spacy/tests/lang/ko/test_lemmatization.py @@ -0,0 +1,12 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +@pytest.mark.parametrize( + "word,lemma", [("새로운", "새롭"), ("빨간", "빨갛"), ("클수록", "크"), ("뭡니까", "뭣"), ("됐다", "되")] +) +def test_ko_lemmatizer_assigns(ko_tokenizer, word, lemma): + test_lemma = ko_tokenizer(word)[0].lemma_ + assert test_lemma == lemma diff --git a/spacy/tests/lang/ko/test_tokenizer.py b/spacy/tests/lang/ko/test_tokenizer.py new file mode 100644 index 000000000..cc7b5fd77 --- /dev/null +++ b/spacy/tests/lang/ko/test_tokenizer.py @@ -0,0 +1,46 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + +# fmt: off +TOKENIZER_TESTS = [("서울 타워 근처에 살고 있습니다.", "서울 타워 근처 에 살 고 있 습니다 ."), + ("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 .")] + +TAG_TESTS = [("서울 타워 근처에 살고 있습니다.", + "NNP NNG NNG JKB VV EC VX EF SF"), + ("영등포구에 있는 맛집 좀 알려주세요.", + "NNP JKB VV ETM NNG MAG VV VX EP SF")] + +FULL_TAG_TESTS = [("영등포구에 있는 맛집 좀 알려주세요.", + "NNP JKB VV ETM NNG MAG VV+EC VX EP+EF SF")] + +POS_TESTS = [("서울 타워 근처에 살고 있습니다.", + "PROPN NOUN NOUN ADP VERB X AUX X PUNCT"), + ("영등포구에 있는 맛집 좀 알려주세요.", + "PROPN ADP VERB X NOUN ADV VERB AUX X PUNCT")] +# fmt: on + + +@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS) +def test_ko_tokenizer(ko_tokenizer, text, expected_tokens): + tokens = [token.text for token in ko_tokenizer(text)] + assert tokens == expected_tokens.split() + + +@pytest.mark.parametrize("text,expected_tags", TAG_TESTS) +def test_ko_tokenizer_tags(ko_tokenizer, text, expected_tags): + tags = [token.tag_ for token in ko_tokenizer(text)] + assert tags == expected_tags.split() + + +@pytest.mark.parametrize("text,expected_tags", FULL_TAG_TESTS) +def test_ko_tokenizer_full_tags(ko_tokenizer, text, expected_tags): + tags = ko_tokenizer(text).user_data["full_tags"] + assert tags == expected_tags.split() + + +@pytest.mark.parametrize("text,expected_pos", POS_TESTS) +def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos): + pos = [token.pos_ for token in ko_tokenizer(text)] + assert pos == expected_pos.split() diff --git a/spacy/tests/lang/lt/test_text.py b/spacy/tests/lang/lt/test_text.py index d2550067b..cac32aa4d 100644 --- a/spacy/tests/lang/lt/test_text.py +++ b/spacy/tests/lang/lt/test_text.py @@ -5,16 +5,24 @@ import pytest def test_lt_tokenizer_handles_long_text(lt_tokenizer): - text = """Tokios sausros kriterijus atitinka pirmadienį atlikti skaičiavimai, palyginus faktinį ir žemiausią -vidutinį daugiametį vandens lygį. Nustatyta, kad iš 48 šalies vandens matavimo stočių 28-iose stotyse vandens lygis -yra žemesnis arba lygus žemiausiam vidutiniam daugiamečiam šiltojo laikotarpio vandens lygiui.""" - tokens = lt_tokenizer(text.replace("\n", "")) + text = """Tokios sausros kriterijus atitinka pirmadienį atlikti skaičiavimai, palyginus faktinį ir žemiausią vidutinį daugiametį vandens lygį. Nustatyta, kad iš 48 šalies vandens matavimo stočių 28-iose stotyse vandens lygis yra žemesnis arba lygus žemiausiam vidutiniam daugiamečiam šiltojo laikotarpio vandens lygiui.""" + tokens = lt_tokenizer(text) assert len(tokens) == 42 -@pytest.mark.parametrize('text,length', [ - ("177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.", 15), - ("ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.", 16)]) +@pytest.mark.parametrize( + "text,length", + [ + ( + "177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.", + 15, + ), + ( + "ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.", + 16, + ), + ], +) def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length): tokens = lt_tokenizer(text) assert len(tokens) == length @@ -26,18 +34,22 @@ def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text): assert len(tokens) == 1 -@pytest.mark.parametrize("text,match", [ - ("10", True), - ("1", True), - ("10,000", True), - ("10,00", True), - ("999.0", True), - ("vienas", True), - ("du", True), - ("milijardas", True), - ("šuo", False), - (",", False), - ("1/2", True)]) +@pytest.mark.parametrize( + "text,match", + [ + ("10", True), + ("1", True), + ("10,000", True), + ("10,00", True), + ("999.0", True), + ("vienas", True), + ("du", True), + ("milijardas", True), + ("šuo", False), + (",", False), + ("1/2", True), + ], +) def test_lt_lex_attrs_like_number(lt_tokenizer, text, match): tokens = lt_tokenizer(text) assert len(tokens) == 1 diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index 54ddd6789..013700d52 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -5,7 +5,6 @@ import pytest import re from spacy.matcher import Matcher, DependencyMatcher from spacy.tokens import Doc, Token -from ..util import get_doc @pytest.fixture @@ -288,24 +287,43 @@ def deps(): def dependency_matcher(en_vocab): def is_brown_yellow(text): return bool(re.compile(r"brown|yellow|over").match(text)) + IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow) pattern1 = [ {"SPEC": {"NODE_NAME": "fox"}, "PATTERN": {"ORTH": "fox"}}, - {"SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"},"PATTERN": {"ORTH": "quick", "DEP": "amod"}}, - {"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, "PATTERN": {IS_BROWN_YELLOW: True}}, + { + "SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, + "PATTERN": {"ORTH": "quick", "DEP": "amod"}, + }, + { + "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, + "PATTERN": {IS_BROWN_YELLOW: True}, + }, ] pattern2 = [ {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, - {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}}, - {"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}} + { + "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, + "PATTERN": {"ORTH": "fox"}, + }, + { + "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, + "PATTERN": {"ORTH": "fox"}, + }, ] pattern3 = [ {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, - {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}}, - {"SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"}, "PATTERN": {"ORTH": "brown"}} + { + "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, + "PATTERN": {"ORTH": "fox"}, + }, + { + "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"}, + "PATTERN": {"ORTH": "brown"}, + }, ] matcher = DependencyMatcher(en_vocab) @@ -320,9 +338,9 @@ def test_dependency_matcher_compile(dependency_matcher): assert len(dependency_matcher) == 3 -def test_dependency_matcher(dependency_matcher, text, heads, deps): - doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps) - matches = dependency_matcher(doc) - # assert matches[0][1] == [[3, 1, 2]] - # assert matches[1][1] == [[4, 3, 3]] - # assert matches[2][1] == [[4, 3, 2]] +# def test_dependency_matcher(dependency_matcher, text, heads, deps): +# doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps) +# matches = dependency_matcher(doc) +# assert matches[0][1] == [[3, 1, 2]] +# assert matches[1][1] == [[4, 3, 3]] +# assert matches[2][1] == [[4, 3, 2]] diff --git a/spacy/tests/pipeline/test_el.py b/spacy/tests/pipeline/test_el.py deleted file mode 100644 index 61baece68..000000000 --- a/spacy/tests/pipeline/test_el.py +++ /dev/null @@ -1,91 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - -from spacy.kb import KnowledgeBase -from spacy.lang.en import English - - -@pytest.fixture -def nlp(): - return English() - - -def test_kb_valid_entities(nlp): - """Test the valid construction of a KB with 3 entities and two aliases""" - mykb = KnowledgeBase(nlp.vocab) - - # adding entities - mykb.add_entity(entity=u'Q1', prob=0.9) - mykb.add_entity(entity=u'Q2') - mykb.add_entity(entity=u'Q3', prob=0.5) - - # adding aliases - mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2]) - mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9]) - - # test the size of the corresponding KB - assert(mykb.get_size_entities() == 3) - assert(mykb.get_size_aliases() == 2) - - -def test_kb_invalid_entities(nlp): - """Test the invalid construction of a KB with an alias linked to a non-existing entity""" - mykb = KnowledgeBase(nlp.vocab) - - # adding entities - mykb.add_entity(entity=u'Q1', prob=0.9) - mykb.add_entity(entity=u'Q2', prob=0.2) - mykb.add_entity(entity=u'Q3', prob=0.5) - - # adding aliases - should fail because one of the given IDs is not valid - with pytest.raises(ValueError): - mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q342'], probabilities=[0.8, 0.2]) - - -def test_kb_invalid_probabilities(nlp): - """Test the invalid construction of a KB with wrong prior probabilities""" - mykb = KnowledgeBase(nlp.vocab) - - # adding entities - mykb.add_entity(entity=u'Q1', prob=0.9) - mykb.add_entity(entity=u'Q2', prob=0.2) - mykb.add_entity(entity=u'Q3', prob=0.5) - - # adding aliases - should fail because the sum of the probabilities exceeds 1 - with pytest.raises(ValueError): - mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.4]) - - -def test_kb_invalid_combination(nlp): - """Test the invalid construction of a KB with non-matching entity and probability lists""" - mykb = KnowledgeBase(nlp.vocab) - - # adding entities - mykb.add_entity(entity=u'Q1', prob=0.9) - mykb.add_entity(entity=u'Q2', prob=0.2) - mykb.add_entity(entity=u'Q3', prob=0.5) - - # adding aliases - should fail because the entities and probabilities vectors are not of equal length - with pytest.raises(ValueError): - mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.3, 0.4, 0.1]) - - -def test_candidate_generation(nlp): - """Test correct candidate generation""" - mykb = KnowledgeBase(nlp.vocab) - - # adding entities - mykb.add_entity(entity=u'Q1', prob=0.9) - mykb.add_entity(entity=u'Q2', prob=0.2) - mykb.add_entity(entity=u'Q3', prob=0.5) - - # adding aliases - mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2]) - mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9]) - - # test the size of the relevant candidates - assert(len(mykb.get_candidates(u'douglas')) == 2) - assert(len(mykb.get_candidates(u'adam')) == 1) - assert(len(mykb.get_candidates(u'shrubbery')) == 0) diff --git a/spacy/tests/pipeline/test_entity_linker.py b/spacy/tests/pipeline/test_entity_linker.py new file mode 100644 index 000000000..cafc380ba --- /dev/null +++ b/spacy/tests/pipeline/test_entity_linker.py @@ -0,0 +1,145 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + +from spacy.kb import KnowledgeBase +from spacy.lang.en import English +from spacy.pipeline import EntityRuler + + +@pytest.fixture +def nlp(): + return English() + + +def test_kb_valid_entities(nlp): + """Test the valid construction of a KB with 3 entities and two aliases""" + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + + # adding entities + mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1]) + mykb.add_entity(entity='Q2', prob=0.5, entity_vector=[2]) + mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3]) + + # adding aliases + mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.2]) + mykb.add_alias(alias='adam', entities=['Q2'], probabilities=[0.9]) + + # test the size of the corresponding KB + assert(mykb.get_size_entities() == 3) + assert(mykb.get_size_aliases() == 2) + + +def test_kb_invalid_entities(nlp): + """Test the invalid construction of a KB with an alias linked to a non-existing entity""" + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + + # adding entities + mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1]) + mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2]) + mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3]) + + # adding aliases - should fail because one of the given IDs is not valid + with pytest.raises(ValueError): + mykb.add_alias(alias='douglas', entities=['Q2', 'Q342'], probabilities=[0.8, 0.2]) + + +def test_kb_invalid_probabilities(nlp): + """Test the invalid construction of a KB with wrong prior probabilities""" + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + + # adding entities + mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1]) + mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2]) + mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3]) + + # adding aliases - should fail because the sum of the probabilities exceeds 1 + with pytest.raises(ValueError): + mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.4]) + + +def test_kb_invalid_combination(nlp): + """Test the invalid construction of a KB with non-matching entity and probability lists""" + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + + # adding entities + mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1]) + mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2]) + mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3]) + + # adding aliases - should fail because the entities and probabilities vectors are not of equal length + with pytest.raises(ValueError): + mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.3, 0.4, 0.1]) + + +def test_kb_invalid_entity_vector(nlp): + """Test the invalid construction of a KB with non-matching entity vector lengths""" + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=3) + + # adding entities + mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1, 2, 3]) + + # this should fail because the kb's expected entity vector length is 3 + with pytest.raises(ValueError): + mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2]) + + +def test_candidate_generation(nlp): + """Test correct candidate generation""" + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + + # adding entities + mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1]) + mykb.add_entity(entity='Q2', prob=0.2, entity_vector=[2]) + mykb.add_entity(entity='Q3', prob=0.5, entity_vector=[3]) + + # adding aliases + mykb.add_alias(alias='douglas', entities=['Q2', 'Q3'], probabilities=[0.8, 0.2]) + mykb.add_alias(alias='adam', entities=['Q2'], probabilities=[0.9]) + + # test the size of the relevant candidates + assert(len(mykb.get_candidates('douglas')) == 2) + assert(len(mykb.get_candidates('adam')) == 1) + assert(len(mykb.get_candidates('shrubbery')) == 0) + + +def test_preserving_links_asdoc(nlp): + """Test that Span.as_doc preserves the existing entity links""" + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + + # adding entities + mykb.add_entity(entity='Q1', prob=0.9, entity_vector=[1]) + mykb.add_entity(entity='Q2', prob=0.8, entity_vector=[1]) + + # adding aliases + mykb.add_alias(alias='Boston', entities=['Q1'], probabilities=[0.7]) + mykb.add_alias(alias='Denver', entities=['Q2'], probabilities=[0.6]) + + # set up pipeline with NER (Entity Ruler) and NEL (prior probability only, model not trained) + sentencizer = nlp.create_pipe("sentencizer") + nlp.add_pipe(sentencizer) + + ruler = EntityRuler(nlp) + patterns = [{"label": "GPE", "pattern": "Boston"}, + {"label": "GPE", "pattern": "Denver"}] + ruler.add_patterns(patterns) + nlp.add_pipe(ruler) + + el_pipe = nlp.create_pipe(name='entity_linker', config={"context_width": 64}) + el_pipe.set_kb(mykb) + el_pipe.begin_training() + el_pipe.context_weight = 0 + el_pipe.prior_weight = 1 + nlp.add_pipe(el_pipe, last=True) + + # test whether the entity links are preserved by the `as_doc()` function + text = "She lives in Boston. He lives in Denver." + doc = nlp(text) + for ent in doc.ents: + orig_text = ent.text + orig_kb_id = ent.kb_id_ + sent_doc = ent.sent.as_doc() + for s_ent in sent_doc.ents: + if s_ent.text == orig_text: + assert s_ent.kb_id_ == orig_kb_id diff --git a/spacy/tests/pipeline/test_entity_ruler.py b/spacy/tests/pipeline/test_entity_ruler.py index a371be38b..5ab1a3af0 100644 --- a/spacy/tests/pipeline/test_entity_ruler.py +++ b/spacy/tests/pipeline/test_entity_ruler.py @@ -111,7 +111,7 @@ def test_entity_ruler_serialize_bytes(nlp, patterns): assert len(new_ruler.patterns) == len(ruler.patterns) for pattern in ruler.patterns: assert pattern in new_ruler.patterns - assert new_ruler.labels == ruler.labels + assert sorted(new_ruler.labels) == sorted(ruler.labels) def test_entity_ruler_serialize_phrase_matcher_attr_bytes(nlp, patterns): diff --git a/spacy/tests/regression/test_issue2001-2500.py b/spacy/tests/regression/test_issue2001-2500.py index 82b3a81a9..4292c8d23 100644 --- a/spacy/tests/regression/test_issue2001-2500.py +++ b/spacy/tests/regression/test_issue2001-2500.py @@ -4,6 +4,7 @@ from __future__ import unicode_literals import pytest import numpy from spacy.tokens import Doc +from spacy.matcher import Matcher from spacy.displacy import render from spacy.gold import iob_to_biluo from spacy.lang.it import Italian @@ -123,6 +124,15 @@ def test_issue2396(en_vocab): assert (span.get_lca_matrix() == matrix).all() +def test_issue2464(en_vocab): + """Test problem with successive ?. This is the same bug, so putting it here.""" + matcher = Matcher(en_vocab) + doc = Doc(en_vocab, words=["a", "b"]) + matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}]) + matches = matcher(doc) + assert len(matches) == 3 + + def test_issue2482(): """Test we can serialize and deserialize a blank NER or parser model.""" nlp = Italian() diff --git a/spacy/tests/regression/test_issue3001-3500.py b/spacy/tests/regression/test_issue3001-3500.py new file mode 100644 index 000000000..3b0c2f1ed --- /dev/null +++ b/spacy/tests/regression/test_issue3001-3500.py @@ -0,0 +1,334 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest +from spacy.lang.en import English +from spacy.lang.de import German +from spacy.pipeline import EntityRuler, EntityRecognizer +from spacy.matcher import Matcher, PhraseMatcher +from spacy.tokens import Doc +from spacy.vocab import Vocab +from spacy.attrs import ENT_IOB, ENT_TYPE +from spacy.compat import pickle, is_python2, unescape_unicode +from spacy import displacy +from spacy.util import decaying +import numpy +import re + +from ..util import get_doc + + +def test_issue3002(): + """Test that the tokenizer doesn't hang on a long list of dots""" + nlp = German() + doc = nlp( + "880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange Zahl" + ) + assert len(doc) == 5 + + +def test_issue3009(en_vocab): + """Test problem with matcher quantifiers""" + patterns = [ + [{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}], + [ + {"LEMMA": "have"}, + {"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"}, + {"LOWER": "to"}, + {"LOWER": "do"}, + {"POS": "ADP"}, + ], + [ + {"LEMMA": "have"}, + {"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"}, + {"LOWER": "to"}, + {"LOWER": "do"}, + {"POS": "ADP"}, + ], + ] + words = ["also", "has", "to", "do", "with"] + tags = ["RB", "VBZ", "TO", "VB", "IN"] + doc = get_doc(en_vocab, words=words, tags=tags) + matcher = Matcher(en_vocab) + for i, pattern in enumerate(patterns): + matcher.add(str(i), None, pattern) + matches = matcher(doc) + assert matches + + +def test_issue3012(en_vocab): + """Test that the is_tagged attribute doesn't get overwritten when we from_array + without tag information.""" + words = ["This", "is", "10", "%", "."] + tags = ["DT", "VBZ", "CD", "NN", "."] + pos = ["DET", "VERB", "NUM", "NOUN", "PUNCT"] + ents = [(2, 4, "PERCENT")] + doc = get_doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents) + assert doc.is_tagged + + expected = ("10", "NUM", "CD", "PERCENT") + assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected + + header = [ENT_IOB, ENT_TYPE] + ent_array = doc.to_array(header) + doc.from_array(header, ent_array) + + assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected + + # Serializing then deserializing + doc_bytes = doc.to_bytes() + doc2 = Doc(en_vocab).from_bytes(doc_bytes) + assert (doc2[2].text, doc2[2].pos_, doc2[2].tag_, doc2[2].ent_type_) == expected + + +def test_issue3199(): + """Test that Span.noun_chunks works correctly if no noun chunks iterator + is available. To make this test future-proof, we're constructing a Doc + with a new Vocab here and setting is_parsed to make sure the noun chunks run. + """ + doc = Doc(Vocab(), words=["This", "is", "a", "sentence"]) + doc.is_parsed = True + assert list(doc[0:3].noun_chunks) == [] + + +def test_issue3209(): + """Test issue that occurred in spaCy nightly where NER labels were being + mapped to classes incorrectly after loading the model, when the labels + were added using ner.add_label(). + """ + nlp = English() + ner = nlp.create_pipe("ner") + nlp.add_pipe(ner) + + ner.add_label("ANIMAL") + nlp.begin_training() + move_names = ["O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL", "U-ANIMAL"] + assert ner.move_names == move_names + nlp2 = English() + nlp2.add_pipe(nlp2.create_pipe("ner")) + nlp2.from_bytes(nlp.to_bytes()) + assert nlp2.get_pipe("ner").move_names == move_names + + +def test_issue3248_1(): + """Test that the PhraseMatcher correctly reports its number of rules, not + total number of patterns.""" + nlp = English() + matcher = PhraseMatcher(nlp.vocab) + matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) + matcher.add("TEST2", None, nlp("d")) + assert len(matcher) == 2 + + +def test_issue3248_2(): + """Test that the PhraseMatcher can be pickled correctly.""" + nlp = English() + matcher = PhraseMatcher(nlp.vocab) + matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) + matcher.add("TEST2", None, nlp("d")) + data = pickle.dumps(matcher) + new_matcher = pickle.loads(data) + assert len(new_matcher) == len(matcher) + + +def test_issue3277(es_tokenizer): + """Test that hyphens are split correctly as prefixes.""" + doc = es_tokenizer("—Yo me llamo... –murmuró el niño– Emilio Sánchez Pérez.") + assert len(doc) == 14 + assert doc[0].text == "\u2014" + assert doc[5].text == "\u2013" + assert doc[9].text == "\u2013" + + +def test_issue3288(en_vocab): + """Test that retokenization works correctly via displaCy when punctuation + is merged onto the preceeding token and tensor is resized.""" + words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"] + heads = [1, 0, -1, 1, 0, 1, -2, -3] + deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"] + doc = get_doc(en_vocab, words=words, heads=heads, deps=deps) + doc.tensor = numpy.zeros((len(words), 96), dtype="float32") + displacy.render(doc) + + +def test_issue3289(): + """Test that Language.to_bytes handles serializing a pipeline component + with an uninitialized model.""" + nlp = English() + nlp.add_pipe(nlp.create_pipe("textcat")) + bytes_data = nlp.to_bytes() + new_nlp = English() + new_nlp.add_pipe(nlp.create_pipe("textcat")) + new_nlp.from_bytes(bytes_data) + + +def test_issue3328(en_vocab): + doc = Doc(en_vocab, words=["Hello", ",", "how", "are", "you", "doing", "?"]) + matcher = Matcher(en_vocab) + patterns = [ + [{"LOWER": {"IN": ["hello", "how"]}}], + [{"LOWER": {"IN": ["you", "doing"]}}], + ] + matcher.add("TEST", None, *patterns) + matches = matcher(doc) + assert len(matches) == 4 + matched_texts = [doc[start:end].text for _, start, end in matches] + assert matched_texts == ["Hello", "how", "you", "doing"] + + +@pytest.mark.xfail +def test_issue3331(en_vocab): + """Test that duplicate patterns for different rules result in multiple + matches, one per rule. + """ + matcher = PhraseMatcher(en_vocab) + matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"])) + matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"])) + doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"]) + matches = matcher(doc) + assert len(matches) == 2 + match_ids = [en_vocab.strings[matches[0][0]], en_vocab.strings[matches[1][0]]] + assert sorted(match_ids) == ["A", "B"] + + +def test_issue3345(): + """Test case where preset entity crosses sentence boundary.""" + nlp = English() + doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"]) + doc[4].is_sent_start = True + ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}]) + ner = EntityRecognizer(doc.vocab) + # Add the OUT action. I wouldn't have thought this would be necessary... + ner.moves.add_action(5, "") + ner.add_label("GPE") + doc = ruler(doc) + # Get into the state just before "New" + state = ner.moves.init_batch([doc])[0] + ner.moves.apply_transition(state, "O") + ner.moves.apply_transition(state, "O") + ner.moves.apply_transition(state, "O") + # Check that B-GPE is valid. + assert ner.moves.is_valid(state, "B-GPE") + + +if is_python2: + # If we have this test in Python 3, pytest chokes, as it can't print the + # string above in the xpass message. + prefix_search = ( + b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])" + b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?" + b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}" + b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|" + b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|" + b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|" + b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|" + b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|" + b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|" + b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|" + b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|" + b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|" + b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|" + b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|" + b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F" + b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8" + b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17" + b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC" + b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940" + b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103" + b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125" + b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F" + b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4" + b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5" + b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B" + b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440" + b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2" + b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800" + b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76" + b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80" + b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004" + b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191" + b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250" + b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0" + b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77" + b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137" + b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E" + b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877" + b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45" + b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129" + b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C" + b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245" + b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A" + b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86" + b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0" + b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1" + b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6" + b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250" + b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400" + b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700" + b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810" + b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890" + b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940" + b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2" + b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF" + b"\\U0001FA60-\\U0001FA6D]" + ) + + def test_issue3356(): + pattern = re.compile(unescape_unicode(prefix_search.decode("utf8"))) + assert not pattern.search("hello") + + +def test_issue3410(): + texts = ["Hello world", "This is a test"] + nlp = English() + matcher = Matcher(nlp.vocab) + phrasematcher = PhraseMatcher(nlp.vocab) + with pytest.deprecated_call(): + docs = list(nlp.pipe(texts, n_threads=4)) + with pytest.deprecated_call(): + docs = list(nlp.tokenizer.pipe(texts, n_threads=4)) + with pytest.deprecated_call(): + list(matcher.pipe(docs, n_threads=4)) + with pytest.deprecated_call(): + list(phrasematcher.pipe(docs, n_threads=4)) + + +def test_issue3447(): + sizes = decaying(10.0, 1.0, 0.5) + size = next(sizes) + assert size == 10.0 + size = next(sizes) + assert size == 10.0 - 0.5 + size = next(sizes) + assert size == 10.0 - 0.5 - 0.5 + + +@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot") +def test_issue3449(): + nlp = English() + nlp.add_pipe(nlp.create_pipe("sentencizer")) + text1 = "He gave the ball to I. Do you want to go to the movies with I?" + text2 = "He gave the ball to I. Do you want to go to the movies with I?" + text3 = "He gave the ball to I.\nDo you want to go to the movies with I?" + t1 = nlp(text1) + t2 = nlp(text2) + t3 = nlp(text3) + assert t1[5].text == "I" + assert t2[5].text == "I" + assert t3[5].text == "I" + + +def test_issue3468(): + """Test that sentence boundaries are set correctly so Doc.is_sentenced can + be restored after serialization.""" + nlp = English() + nlp.add_pipe(nlp.create_pipe("sentencizer")) + doc = nlp("Hello world") + assert doc[0].is_sent_start + assert doc.is_sentenced + assert len(list(doc.sents)) == 1 + doc_bytes = doc.to_bytes() + new_doc = Doc(nlp.vocab).from_bytes(doc_bytes) + assert new_doc[0].is_sent_start + assert new_doc.is_sentenced + assert len(list(new_doc.sents)) == 1 diff --git a/spacy/tests/regression/test_issue3002.py b/spacy/tests/regression/test_issue3002.py deleted file mode 100644 index 54e661d1f..000000000 --- a/spacy/tests/regression/test_issue3002.py +++ /dev/null @@ -1,11 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.de import German - - -def test_issue3002(): - """Test that the tokenizer doesn't hang on a long list of dots""" - nlp = German() - doc = nlp('880.794.982.218.444.893.023.439.794.626.120.190.780.624.990.275.671 ist eine lange Zahl') - assert len(doc) == 5 diff --git a/spacy/tests/regression/test_issue3009.py b/spacy/tests/regression/test_issue3009.py deleted file mode 100644 index 25f208903..000000000 --- a/spacy/tests/regression/test_issue3009.py +++ /dev/null @@ -1,67 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest -from spacy.matcher import Matcher -from spacy.tokens import Doc - - -PATTERNS = [ - ("1", [[{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}]]), - ( - "2", - [ - [ - {"LEMMA": "have"}, - {"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"}, - {"LOWER": "to"}, - {"LOWER": "do"}, - {"POS": "ADP"}, - ] - ], - ), - ( - "3", - [ - [ - {"LEMMA": "have"}, - {"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"}, - {"LOWER": "to"}, - {"LOWER": "do"}, - {"POS": "ADP"}, - ] - ], - ), -] - - -@pytest.fixture -def doc(en_tokenizer): - doc = en_tokenizer("also has to do with") - doc[0].tag_ = "RB" - doc[1].tag_ = "VBZ" - doc[2].tag_ = "TO" - doc[3].tag_ = "VB" - doc[4].tag_ = "IN" - return doc - - -@pytest.fixture -def matcher(en_tokenizer): - return Matcher(en_tokenizer.vocab) - - -@pytest.mark.parametrize("pattern", PATTERNS) -def test_issue3009(doc, matcher, pattern): - """Test problem with matcher quantifiers""" - matcher.add(pattern[0], None, *pattern[1]) - matches = matcher(doc) - assert matches - - -def test_issue2464(matcher): - """Test problem with successive ?. This is the same bug, so putting it here.""" - doc = Doc(matcher.vocab, words=["a", "b"]) - matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}]) - matches = matcher(doc) - assert len(matches) == 3 diff --git a/spacy/tests/regression/test_issue3012.py b/spacy/tests/regression/test_issue3012.py deleted file mode 100644 index 8fdc8b318..000000000 --- a/spacy/tests/regression/test_issue3012.py +++ /dev/null @@ -1,31 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...attrs import ENT_IOB, ENT_TYPE -from ...tokens import Doc -from ..util import get_doc - - -def test_issue3012(en_vocab): - """Test that the is_tagged attribute doesn't get overwritten when we from_array - without tag information.""" - words = ["This", "is", "10", "%", "."] - tags = ["DT", "VBZ", "CD", "NN", "."] - pos = ["DET", "VERB", "NUM", "NOUN", "PUNCT"] - ents = [(2, 4, "PERCENT")] - doc = get_doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents) - assert doc.is_tagged - - expected = ("10", "NUM", "CD", "PERCENT") - assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected - - header = [ENT_IOB, ENT_TYPE] - ent_array = doc.to_array(header) - doc.from_array(header, ent_array) - - assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected - - # serializing then deserializing - doc_bytes = doc.to_bytes() - doc2 = Doc(en_vocab).from_bytes(doc_bytes) - assert (doc2[2].text, doc2[2].pos_, doc2[2].tag_, doc2[2].ent_type_) == expected diff --git a/spacy/tests/regression/test_issue3199.py b/spacy/tests/regression/test_issue3199.py deleted file mode 100644 index d80a55330..000000000 --- a/spacy/tests/regression/test_issue3199.py +++ /dev/null @@ -1,15 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.tokens import Doc -from spacy.vocab import Vocab - - -def test_issue3199(): - """Test that Span.noun_chunks works correctly if no noun chunks iterator - is available. To make this test future-proof, we're constructing a Doc - with a new Vocab here and setting is_parsed to make sure the noun chunks run. - """ - doc = Doc(Vocab(), words=["This", "is", "a", "sentence"]) - doc.is_parsed = True - assert list(doc[0:3].noun_chunks) == [] diff --git a/spacy/tests/regression/test_issue3209.py b/spacy/tests/regression/test_issue3209.py deleted file mode 100644 index 469e38b8c..000000000 --- a/spacy/tests/regression/test_issue3209.py +++ /dev/null @@ -1,23 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English - - -def test_issue3209(): - """Test issue that occurred in spaCy nightly where NER labels were being - mapped to classes incorrectly after loading the model, when the labels - were added using ner.add_label(). - """ - nlp = English() - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) - - ner.add_label("ANIMAL") - nlp.begin_training() - move_names = ["O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL", "U-ANIMAL"] - assert ner.move_names == move_names - nlp2 = English() - nlp2.add_pipe(nlp2.create_pipe("ner")) - nlp2.from_bytes(nlp.to_bytes()) - assert nlp2.get_pipe("ner").move_names == move_names diff --git a/spacy/tests/regression/test_issue3248.py b/spacy/tests/regression/test_issue3248.py deleted file mode 100644 index c4b592f3c..000000000 --- a/spacy/tests/regression/test_issue3248.py +++ /dev/null @@ -1,27 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from spacy.matcher import PhraseMatcher -from spacy.lang.en import English -from spacy.compat import pickle - - -def test_issue3248_1(): - """Test that the PhraseMatcher correctly reports its number of rules, not - total number of patterns.""" - nlp = English() - matcher = PhraseMatcher(nlp.vocab) - matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) - matcher.add("TEST2", None, nlp("d")) - assert len(matcher) == 2 - - -def test_issue3248_2(): - """Test that the PhraseMatcher can be pickled correctly.""" - nlp = English() - matcher = PhraseMatcher(nlp.vocab) - matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) - matcher.add("TEST2", None, nlp("d")) - data = pickle.dumps(matcher) - new_matcher = pickle.loads(data) - assert len(new_matcher) == len(matcher) diff --git a/spacy/tests/regression/test_issue3277.py b/spacy/tests/regression/test_issue3277.py deleted file mode 100644 index 88ea67774..000000000 --- a/spacy/tests/regression/test_issue3277.py +++ /dev/null @@ -1,11 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - - -def test_issue3277(es_tokenizer): - """Test that hyphens are split correctly as prefixes.""" - doc = es_tokenizer("—Yo me llamo... –murmuró el niño– Emilio Sánchez Pérez.") - assert len(doc) == 14 - assert doc[0].text == "\u2014" - assert doc[5].text == "\u2013" - assert doc[9].text == "\u2013" diff --git a/spacy/tests/regression/test_issue3288.py b/spacy/tests/regression/test_issue3288.py deleted file mode 100644 index 188bf361c..000000000 --- a/spacy/tests/regression/test_issue3288.py +++ /dev/null @@ -1,18 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import numpy -from spacy import displacy - -from ..util import get_doc - - -def test_issue3288(en_vocab): - """Test that retokenization works correctly via displaCy when punctuation - is merged onto the preceeding token and tensor is resized.""" - words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"] - heads = [1, 0, -1, 1, 0, 1, -2, -3] - deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"] - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps) - doc.tensor = numpy.zeros((len(words), 96), dtype="float32") - displacy.render(doc) diff --git a/spacy/tests/regression/test_issue3289.py b/spacy/tests/regression/test_issue3289.py deleted file mode 100644 index 0e64f07ce..000000000 --- a/spacy/tests/regression/test_issue3289.py +++ /dev/null @@ -1,15 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from spacy.lang.en import English - - -def test_issue3289(): - """Test that Language.to_bytes handles serializing a pipeline component - with an uninitialized model.""" - nlp = English() - nlp.add_pipe(nlp.create_pipe("textcat")) - bytes_data = nlp.to_bytes() - new_nlp = English() - new_nlp.add_pipe(nlp.create_pipe("textcat")) - new_nlp.from_bytes(bytes_data) diff --git a/spacy/tests/regression/test_issue3328.py b/spacy/tests/regression/test_issue3328.py deleted file mode 100644 index c397feebb..000000000 --- a/spacy/tests/regression/test_issue3328.py +++ /dev/null @@ -1,19 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from spacy.matcher import Matcher -from spacy.tokens import Doc - - -def test_issue3328(en_vocab): - doc = Doc(en_vocab, words=["Hello", ",", "how", "are", "you", "doing", "?"]) - matcher = Matcher(en_vocab) - patterns = [ - [{"LOWER": {"IN": ["hello", "how"]}}], - [{"LOWER": {"IN": ["you", "doing"]}}], - ] - matcher.add("TEST", None, *patterns) - matches = matcher(doc) - assert len(matches) == 4 - matched_texts = [doc[start:end].text for _, start, end in matches] - assert matched_texts == ["Hello", "how", "you", "doing"] diff --git a/spacy/tests/regression/test_issue3331.py b/spacy/tests/regression/test_issue3331.py deleted file mode 100644 index c30712f81..000000000 --- a/spacy/tests/regression/test_issue3331.py +++ /dev/null @@ -1,21 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest -from spacy.matcher import PhraseMatcher -from spacy.tokens import Doc - - -@pytest.mark.xfail -def test_issue3331(en_vocab): - """Test that duplicate patterns for different rules result in multiple - matches, one per rule. - """ - matcher = PhraseMatcher(en_vocab) - matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"])) - matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"])) - doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"]) - matches = matcher(doc) - assert len(matches) == 2 - match_ids = [en_vocab.strings[matches[0][0]], en_vocab.strings[matches[1][0]]] - assert sorted(match_ids) == ["A", "B"] diff --git a/spacy/tests/regression/test_issue3345.py b/spacy/tests/regression/test_issue3345.py deleted file mode 100644 index c358fd7bc..000000000 --- a/spacy/tests/regression/test_issue3345.py +++ /dev/null @@ -1,26 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.tokens import Doc -from spacy.pipeline import EntityRuler, EntityRecognizer - - -def test_issue3345(): - """Test case where preset entity crosses sentence boundary.""" - nlp = English() - doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"]) - doc[4].is_sent_start = True - ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}]) - ner = EntityRecognizer(doc.vocab) - # Add the OUT action. I wouldn't have thought this would be necessary... - ner.moves.add_action(5, "") - ner.add_label("GPE") - doc = ruler(doc) - # Get into the state just before "New" - state = ner.moves.init_batch([doc])[0] - ner.moves.apply_transition(state, "O") - ner.moves.apply_transition(state, "O") - ner.moves.apply_transition(state, "O") - # Check that B-GPE is valid. - assert ner.moves.is_valid(state, "B-GPE") diff --git a/spacy/tests/regression/test_issue3356.py b/spacy/tests/regression/test_issue3356.py deleted file mode 100644 index f8d16459c..000000000 --- a/spacy/tests/regression/test_issue3356.py +++ /dev/null @@ -1,72 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import re -from spacy import compat - -prefix_search = ( - b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])" - b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?" - b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}" - b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|" - b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|" - b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|" - b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|" - b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|" - b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|" - b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|" - b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|" - b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|" - b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|" - b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|" - b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F" - b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8" - b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17" - b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC" - b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940" - b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103" - b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125" - b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F" - b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4" - b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5" - b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B" - b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440" - b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2" - b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800" - b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76" - b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80" - b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004" - b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191" - b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250" - b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0" - b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77" - b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137" - b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E" - b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877" - b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45" - b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129" - b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C" - b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245" - b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A" - b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86" - b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0" - b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1" - b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6" - b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250" - b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400" - b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700" - b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810" - b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890" - b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940" - b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2" - b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF" - b"\\U0001FA60-\\U0001FA6D]" -) - - -if compat.is_python2: - # If we have this test in Python 3, pytest chokes, as it can't print the - # string above in the xpass message. - def test_issue3356(): - pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8"))) - assert not pattern.search("hello") diff --git a/spacy/tests/regression/test_issue3410.py b/spacy/tests/regression/test_issue3410.py deleted file mode 100644 index 5d2ac5ba3..000000000 --- a/spacy/tests/regression/test_issue3410.py +++ /dev/null @@ -1,21 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy.lang.en import English -from spacy.matcher import Matcher, PhraseMatcher - - -def test_issue3410(): - texts = ["Hello world", "This is a test"] - nlp = English() - matcher = Matcher(nlp.vocab) - phrasematcher = PhraseMatcher(nlp.vocab) - with pytest.deprecated_call(): - docs = list(nlp.pipe(texts, n_threads=4)) - with pytest.deprecated_call(): - docs = list(nlp.tokenizer.pipe(texts, n_threads=4)) - with pytest.deprecated_call(): - list(matcher.pipe(docs, n_threads=4)) - with pytest.deprecated_call(): - list(phrasematcher.pipe(docs, n_threads=4)) diff --git a/spacy/tests/regression/test_issue3447.py b/spacy/tests/regression/test_issue3447.py deleted file mode 100644 index 0ca1f9e67..000000000 --- a/spacy/tests/regression/test_issue3447.py +++ /dev/null @@ -1,14 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.util import decaying - - -def test_issue3447(): - sizes = decaying(10.0, 1.0, 0.5) - size = next(sizes) - assert size == 10.0 - size = next(sizes) - assert size == 10.0 - 0.5 - size = next(sizes) - assert size == 10.0 - 0.5 - 0.5 diff --git a/spacy/tests/regression/test_issue3449.py b/spacy/tests/regression/test_issue3449.py deleted file mode 100644 index deff49fd6..000000000 --- a/spacy/tests/regression/test_issue3449.py +++ /dev/null @@ -1,21 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - -from spacy.lang.en import English - - -@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot") -def test_issue3449(): - nlp = English() - nlp.add_pipe(nlp.create_pipe("sentencizer")) - text1 = "He gave the ball to I. Do you want to go to the movies with I?" - text2 = "He gave the ball to I. Do you want to go to the movies with I?" - text3 = "He gave the ball to I.\nDo you want to go to the movies with I?" - t1 = nlp(text1) - t2 = nlp(text2) - t3 = nlp(text3) - assert t1[5].text == "I" - assert t2[5].text == "I" - assert t3[5].text == "I" diff --git a/spacy/tests/regression/test_issue3468.py b/spacy/tests/regression/test_issue3468.py deleted file mode 100644 index ebbed2640..000000000 --- a/spacy/tests/regression/test_issue3468.py +++ /dev/null @@ -1,21 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.tokens import Doc - - -def test_issue3468(): - """Test that sentence boundaries are set correctly so Doc.is_sentenced can - be restored after serialization.""" - nlp = English() - nlp.add_pipe(nlp.create_pipe("sentencizer")) - doc = nlp("Hello world") - assert doc[0].is_sent_start - assert doc.is_sentenced - assert len(list(doc.sents)) == 1 - doc_bytes = doc.to_bytes() - new_doc = Doc(nlp.vocab).from_bytes(doc_bytes) - assert new_doc[0].is_sent_start - assert new_doc.is_sentenced - assert len(list(new_doc.sents)) == 1 diff --git a/spacy/tests/regression/test_issue3526.py b/spacy/tests/regression/test_issue3526.py index 3949c4b1c..c6f513730 100644 --- a/spacy/tests/regression/test_issue3526.py +++ b/spacy/tests/regression/test_issue3526.py @@ -7,6 +7,7 @@ from spacy.language import Language from spacy.pipeline import EntityRuler from spacy import load import srsly + from ..util import make_tempdir @@ -61,10 +62,9 @@ def test_entity_ruler_from_disk_old_format_safe(patterns, en_vocab): nlp = Language(vocab=en_vocab) ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) with make_tempdir() as tmpdir: - out_file = tmpdir / "entity_ruler.jsonl" - srsly.write_jsonl(out_file, ruler.patterns) - new_ruler = EntityRuler(nlp) - new_ruler = new_ruler.from_disk(out_file) + out_file = tmpdir / "entity_ruler" + srsly.write_jsonl(out_file.with_suffix(".jsonl"), ruler.patterns) + new_ruler = EntityRuler(nlp).from_disk(out_file) for pattern in ruler.patterns: assert pattern in new_ruler.patterns assert len(new_ruler) == len(ruler) @@ -79,8 +79,10 @@ def test_entity_ruler_in_pipeline_from_issue(patterns, en_vocab): nlp.add_pipe(ruler) with make_tempdir() as tmpdir: nlp.to_disk(tmpdir) - assert nlp.pipeline[-1][-1].patterns == [{"label": "ORG", "pattern": "Apple"}] - assert nlp.pipeline[-1][-1].overwrite is True + ruler = nlp.get_pipe("entity_ruler") + assert ruler.patterns == [{"label": "ORG", "pattern": "Apple"}] + assert ruler.overwrite is True nlp2 = load(tmpdir) - assert nlp2.pipeline[-1][-1].patterns == [{"label": "ORG", "pattern": "Apple"}] - assert nlp2.pipeline[-1][-1].overwrite is True + new_ruler = nlp2.get_pipe("entity_ruler") + assert new_ruler.patterns == [{"label": "ORG", "pattern": "Apple"}] + assert new_ruler.overwrite is True diff --git a/spacy/tests/regression/test_issue3611.py b/spacy/tests/regression/test_issue3611.py new file mode 100644 index 000000000..29aa5370d --- /dev/null +++ b/spacy/tests/regression/test_issue3611.py @@ -0,0 +1,51 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest +import spacy +from spacy.util import minibatch, compounding + + +def test_issue3611(): + """ Test whether adding n-grams in the textcat works even when n > token length of some docs """ + unique_classes = ["offensive", "inoffensive"] + x_train = ["This is an offensive text", + "This is the second offensive text", + "inoff"] + y_train = ["offensive", "offensive", "inoffensive"] + + # preparing the data + pos_cats = list() + for train_instance in y_train: + pos_cats.append({label: label == train_instance for label in unique_classes}) + train_data = list(zip(x_train, [{'cats': cats} for cats in pos_cats])) + + # set up the spacy model with a text categorizer component + nlp = spacy.blank('en') + + textcat = nlp.create_pipe( + "textcat", + config={ + "exclusive_classes": True, + "architecture": "bow", + "ngram_size": 2 + } + ) + + for label in unique_classes: + textcat.add_label(label) + nlp.add_pipe(textcat, last=True) + + # training the network + other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat'] + with nlp.disable_pipes(*other_pipes): + optimizer = nlp.begin_training() + for i in range(3): + losses = {} + batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001)) + + for batch in batches: + texts, annotations = zip(*batch) + nlp.update(docs=texts, golds=annotations, sgd=optimizer, drop=0.1, losses=losses) + + diff --git a/spacy/tests/regression/test_issue3625.py b/spacy/tests/regression/test_issue3625.py new file mode 100644 index 000000000..e3e0f25ee --- /dev/null +++ b/spacy/tests/regression/test_issue3625.py @@ -0,0 +1,10 @@ +# coding: utf8 +from __future__ import unicode_literals + +from spacy.lang.hi import Hindi + +def test_issue3625(): + """Test that default punctuation rules applies to hindi unicode characters""" + nlp = Hindi() + doc = nlp(u"hi. how हुए. होटल, होटल") + assert [token.text for token in doc] == ['hi', '.', 'how', 'हुए', '.', 'होटल', ',', 'होटल'] \ No newline at end of file diff --git a/spacy/tests/regression/test_issue3839.py b/spacy/tests/regression/test_issue3839.py index fa915faf0..34d6bb46e 100644 --- a/spacy/tests/regression/test_issue3839.py +++ b/spacy/tests/regression/test_issue3839.py @@ -6,7 +6,6 @@ from spacy.matcher import Matcher from spacy.tokens import Doc -@pytest.mark.xfail def test_issue3839(en_vocab): """Test that match IDs returned by the matcher are correct, are in the string """ doc = Doc(en_vocab, words=["terrific", "group", "of", "people"]) diff --git a/spacy/tests/regression/test_issue3869.py b/spacy/tests/regression/test_issue3869.py new file mode 100644 index 000000000..42584b133 --- /dev/null +++ b/spacy/tests/regression/test_issue3869.py @@ -0,0 +1,31 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest + +from spacy.attrs import IS_ALPHA +from spacy.lang.en import English + + +@pytest.mark.parametrize( + "sentence", + [ + 'The story was to the effect that a young American student recently called on Professor Christlieb with a letter of introduction.', + 'The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale\'s #1.', + 'The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale\'s number one', + 'Indeed, making the one who remains do all the work has installed him into a position of such insolent tyranny, it will take a month at least to reduce him to his proper proportions.', + "It was a missed assignment, but it shouldn't have resulted in a turnover ..." + ], +) +def test_issue3869(sentence): + """Test that the Doc's count_by function works consistently""" + nlp = English() + doc = nlp(sentence) + + count = 0 + for token in doc: + count += token.is_alpha + + assert count == doc.count_by(IS_ALPHA).get(1, 0) + + diff --git a/spacy/tests/regression/test_issue3880.py b/spacy/tests/regression/test_issue3880.py new file mode 100644 index 000000000..6de373f11 --- /dev/null +++ b/spacy/tests/regression/test_issue3880.py @@ -0,0 +1,22 @@ +# coding: utf8 +from __future__ import unicode_literals + +from spacy.lang.en import English + + +def test_issue3880(): + """Test that `nlp.pipe()` works when an empty string ends the batch. + + Fixed in v7.0.5 of Thinc. + """ + texts = ["hello", "world", "", ""] + nlp = English() + nlp.add_pipe(nlp.create_pipe("parser")) + nlp.add_pipe(nlp.create_pipe("ner")) + nlp.add_pipe(nlp.create_pipe("tagger")) + nlp.get_pipe("parser").add_label("dep") + nlp.get_pipe("ner").add_label("PERSON") + nlp.get_pipe("tagger").add_label("NN") + nlp.begin_training() + for doc in nlp.pipe(texts): + pass diff --git a/spacy/tests/serialize/test_serialize_kb.py b/spacy/tests/serialize/test_serialize_kb.py new file mode 100644 index 000000000..fa7253fa1 --- /dev/null +++ b/spacy/tests/serialize/test_serialize_kb.py @@ -0,0 +1,74 @@ +# coding: utf-8 +from __future__ import unicode_literals + +from ..util import make_tempdir +from ...util import ensure_path + +from spacy.kb import KnowledgeBase + + +def test_serialize_kb_disk(en_vocab): + # baseline assertions + kb1 = _get_dummy_kb(en_vocab) + _check_kb(kb1) + + # dumping to file & loading back in + with make_tempdir() as d: + dir_path = ensure_path(d) + if not dir_path.exists(): + dir_path.mkdir() + file_path = dir_path / "kb" + kb1.dump(str(file_path)) + + kb2 = KnowledgeBase(vocab=en_vocab, entity_vector_length=3) + kb2.load_bulk(str(file_path)) + + # final assertions + _check_kb(kb2) + + +def _get_dummy_kb(vocab): + kb = KnowledgeBase(vocab=vocab, entity_vector_length=3) + + kb.add_entity(entity='Q53', prob=0.33, entity_vector=[0, 5, 3]) + kb.add_entity(entity='Q17', prob=0.2, entity_vector=[7, 1, 0]) + kb.add_entity(entity='Q007', prob=0.7, entity_vector=[0, 0, 7]) + kb.add_entity(entity='Q44', prob=0.4, entity_vector=[4, 4, 4]) + + kb.add_alias(alias='double07', entities=['Q17', 'Q007'], probabilities=[0.1, 0.9]) + kb.add_alias(alias='guy', entities=['Q53', 'Q007', 'Q17', 'Q44'], probabilities=[0.3, 0.3, 0.2, 0.1]) + kb.add_alias(alias='random', entities=['Q007'], probabilities=[1.0]) + + return kb + + +def _check_kb(kb): + # check entities + assert kb.get_size_entities() == 4 + for entity_string in ['Q53', 'Q17', 'Q007', 'Q44']: + assert entity_string in kb.get_entity_strings() + for entity_string in ['', 'Q0']: + assert entity_string not in kb.get_entity_strings() + + # check aliases + assert kb.get_size_aliases() == 3 + for alias_string in ['double07', 'guy', 'random']: + assert alias_string in kb.get_alias_strings() + for alias_string in ['nothingness', '', 'randomnoise']: + assert alias_string not in kb.get_alias_strings() + + # check candidates & probabilities + candidates = sorted(kb.get_candidates('double07'), key=lambda x: x.entity_) + assert len(candidates) == 2 + + assert candidates[0].entity_ == 'Q007' + assert 0.6999 < candidates[0].entity_freq < 0.701 + assert candidates[0].entity_vector == [0, 0, 7] + assert candidates[0].alias_ == 'double07' + assert 0.899 < candidates[0].prior_prob < 0.901 + + assert candidates[1].entity_ == 'Q17' + assert 0.199 < candidates[1].entity_freq < 0.201 + assert candidates[1].entity_vector == [7, 1, 0] + assert candidates[1].alias_ == 'double07' + assert 0.099 < candidates[1].prior_prob < 0.101 diff --git a/spacy/tokens/_serialize.py b/spacy/tokens/_serialize.py index 43ea78242..41f524839 100644 --- a/spacy/tokens/_serialize.py +++ b/spacy/tokens/_serialize.py @@ -11,29 +11,27 @@ from ..tokens import Doc from ..attrs import SPACY, ORTH -class Binder(object): +class DocBox(object): """Serialize analyses from a collection of doc objects.""" - def __init__(self, attrs=None): - """Create a Binder object, to hold serialized annotations. + def __init__(self, attrs=None, store_user_data=False): + """Create a DocBox object, to hold serialized annotations. attrs (list): List of attributes to serialize. 'orth' and 'spacy' are always serialized, so they're not required. Defaults to None. """ attrs = attrs or [] - self.attrs = list(attrs) # Ensure ORTH is always attrs[0] - if ORTH in self.attrs: - self.attrs.pop(ORTH) - if SPACY in self.attrs: - self.attrs.pop(SPACY) + self.attrs = [attr for attr in attrs if attr != ORTH and attr != SPACY] self.attrs.insert(0, ORTH) self.tokens = [] self.spaces = [] + self.user_data = [] self.strings = set() + self.store_user_data = store_user_data def add(self, doc): - """Add a doc's annotations to the binder for serialization.""" + """Add a doc's annotations to the DocBox for serialization.""" array = doc.to_array(self.attrs) if len(array.shape) == 1: array = array.reshape((array.shape[0], 1)) @@ -43,27 +41,35 @@ class Binder(object): spaces = spaces.reshape((spaces.shape[0], 1)) self.spaces.append(numpy.asarray(spaces, dtype=bool)) self.strings.update(w.text for w in doc) + if self.store_user_data: + self.user_data.append(srsly.msgpack_dumps(doc.user_data)) def get_docs(self, vocab): """Recover Doc objects from the annotations, using the given vocab.""" for string in self.strings: vocab[string] orth_col = self.attrs.index(ORTH) - for tokens, spaces in zip(self.tokens, self.spaces): + for i in range(len(self.tokens)): + tokens = self.tokens[i] + spaces = self.spaces[i] words = [vocab.strings[orth] for orth in tokens[:, orth_col]] doc = Doc(vocab, words=words, spaces=spaces) doc = doc.from_array(self.attrs, tokens) + if self.store_user_data: + doc.user_data.update(srsly.msgpack_loads(self.user_data[i])) yield doc def merge(self, other): - """Extend the annotations of this binder with the annotations from another.""" + """Extend the annotations of this DocBox with the annotations from another.""" assert self.attrs == other.attrs self.tokens.extend(other.tokens) self.spaces.extend(other.spaces) self.strings.update(other.strings) + if self.store_user_data: + self.user_data.extend(other.user_data) def to_bytes(self): - """Serialize the binder's annotations into a byte string.""" + """Serialize the DocBox's annotations into a byte string.""" for tokens in self.tokens: assert len(tokens.shape) == 2, tokens.shape lengths = [len(tokens) for tokens in self.tokens] @@ -74,10 +80,12 @@ class Binder(object): "lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"), "strings": list(self.strings), } + if self.store_user_data: + msg["user_data"] = self.user_data return gzip.compress(srsly.msgpack_dumps(msg)) def from_bytes(self, string): - """Deserialize the binder's annotations from a byte string.""" + """Deserialize the DocBox's annotations from a byte string.""" msg = srsly.msgpack_loads(gzip.decompress(string)) self.attrs = msg["attrs"] self.strings = set(msg["strings"]) @@ -89,29 +97,38 @@ class Binder(object): flat_spaces = flat_spaces.reshape((flat_spaces.size, 1)) self.tokens = NumpyOps().unflatten(flat_tokens, lengths) self.spaces = NumpyOps().unflatten(flat_spaces, lengths) + if self.store_user_data and "user_data" in msg: + self.user_data = list(msg["user_data"]) for tokens in self.tokens: assert len(tokens.shape) == 2, tokens.shape return self -def merge_bytes(binder_strings): - """Concatenate multiple serialized binders into one byte string.""" - output = None - for byte_string in binder_strings: - binder = Binder().from_bytes(byte_string) - if output is None: - output = binder - else: - output.merge(binder) - return output.to_bytes() +def merge_boxes(boxes): + merged = None + for byte_string in boxes: + if byte_string is not None: + box = DocBox(store_user_data=True).from_bytes(byte_string) + if merged is None: + merged = box + else: + merged.merge(box) + if merged is not None: + return merged.to_bytes() + else: + return b"" -def pickle_binder(binder): - return (unpickle_binder, (binder.to_bytes(),)) +def pickle_box(box): + return (unpickle_box, (box.to_bytes(),)) -def unpickle_binder(byte_string): - return Binder().from_bytes(byte_string) +def unpickle_box(byte_string): + return DocBox().from_bytes(byte_string) -copy_reg.pickle(Binder, pickle_binder, unpickle_binder) +copy_reg.pickle(DocBox, pickle_box, unpickle_box) +# Compatibility, as we had named it this previously. +Binder = DocBox + +__all__ = ["DocBox"] diff --git a/spacy/tokens/doc.pxd b/spacy/tokens/doc.pxd index 7cdc2316a..62665fcc5 100644 --- a/spacy/tokens/doc.pxd +++ b/spacy/tokens/doc.pxd @@ -1,6 +1,5 @@ from cymem.cymem cimport Pool cimport numpy as np -from preshed.counter cimport PreshCounter from ..vocab cimport Vocab from ..structs cimport TokenC, LexemeC diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index 131c43d37..c1883f9c0 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -9,6 +9,7 @@ cimport cython cimport numpy as np from libc.string cimport memcpy, memset from libc.math cimport sqrt +from collections import Counter import numpy import numpy.linalg @@ -22,7 +23,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME from ..typedefs cimport attr_t, flags_t from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB -from ..attrs cimport ENT_TYPE, SENT_START, attr_id_t +from ..attrs cimport ENT_TYPE, ENT_KB_ID, SENT_START, attr_id_t from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t from ..attrs import intify_attrs, IDS @@ -64,6 +65,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil: return token.ent_iob elif feat_name == ENT_TYPE: return token.ent_type + elif feat_name == ENT_KB_ID: + return token.ent_kb_id else: return Lexeme.get_struct_attr(token.lex, feat_name) @@ -85,13 +88,14 @@ cdef class Doc: Python-level `Token` and `Span` objects are views of this array, i.e. they don't own the data themselves. - EXAMPLE: Construction 1 + EXAMPLE: + Construction 1 >>> doc = nlp(u'Some text') Construction 2 >>> from spacy.tokens import Doc >>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'], - spaces=[True, False, False]) + >>> spaces=[True, False, False]) DOCS: https://spacy.io/api/doc """ @@ -237,6 +241,8 @@ cdef class Doc: return True if self.is_parsed: return True + if len(self) < 2: + return True for i in range(1, self.length): if self.c[i].sent_start == -1 or self.c[i].sent_start == 1: return True @@ -248,6 +254,8 @@ cdef class Doc: *any* of the tokens has a named entity tag set (even if the others are uknown values). """ + if len(self) == 0: + return True for i in range(self.length): if self.c[i].ent_iob != 0: return True @@ -690,7 +698,7 @@ cdef class Doc: # Handle 1d case return output if len(attr_ids) >= 2 else output.reshape((self.length,)) - def count_by(self, attr_id_t attr_id, exclude=None, PreshCounter counts=None): + def count_by(self, attr_id_t attr_id, exclude=None, object counts=None): """Count the frequencies of a given attribute. Produces a dict of `{attribute (int): count (ints)}` frequencies, keyed by the values of the given attribute ID. @@ -705,19 +713,18 @@ cdef class Doc: cdef size_t count if counts is None: - counts = PreshCounter() + counts = Counter() output_dict = True else: output_dict = False # Take this check out of the loop, for a bit of extra speed if exclude is None: for i in range(self.length): - counts.inc(get_token_attr(&self.c[i], attr_id), 1) + counts[get_token_attr(&self.c[i], attr_id)] += 1 else: for i in range(self.length): if not exclude(self[i]): - attr = get_token_attr(&self.c[i], attr_id) - counts.inc(attr, 1) + counts[get_token_attr(&self.c[i], attr_id)] += 1 if output_dict: return dict(counts) @@ -850,7 +857,7 @@ cdef class Doc: DOCS: https://spacy.io/api/doc#to_bytes """ - array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE] + array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE] # TODO: ENT_KB_ID ? if self.is_tagged: array_head.append(TAG) # If doc parsed add head and dep attribute @@ -1004,6 +1011,7 @@ cdef class Doc: """ cdef unicode tag, lemma, ent_type deprecation_warning(Warnings.W013.format(obj="Doc")) + # TODO: ENT_KB_ID ? if len(args) == 3: deprecation_warning(Warnings.W003) tag, lemma, ent_type = args diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx index 97b6a1adc..3f4f4418b 100644 --- a/spacy/tokens/span.pyx +++ b/spacy/tokens/span.pyx @@ -210,7 +210,7 @@ cdef class Span: words = [t.text for t in self] spaces = [bool(t.whitespace_) for t in self] cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces) - array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE] + array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_KB_ID] if self.doc.is_tagged: array_head.append(TAG) # If doc parsed add head and dep attribute diff --git a/spacy/tokens/token.pxd b/spacy/tokens/token.pxd index bb9f7d070..ec5df3fac 100644 --- a/spacy/tokens/token.pxd +++ b/spacy/tokens/token.pxd @@ -53,6 +53,8 @@ cdef class Token: return token.ent_iob elif feat_name == ENT_TYPE: return token.ent_type + elif feat_name == ENT_KB_ID: + return token.ent_kb_id elif feat_name == SENT_START: return token.sent_start else: @@ -79,5 +81,7 @@ cdef class Token: token.ent_iob = value elif feat_name == ENT_TYPE: token.ent_type = value + elif feat_name == ENT_KB_ID: + token.ent_kb_id = value elif feat_name == SENT_START: token.sent_start = value diff --git a/website/docs/api/annotation.md b/website/docs/api/annotation.md index a5bb30b6f..ed0e0b3e0 100644 --- a/website/docs/api/annotation.md +++ b/website/docs/api/annotation.md @@ -520,7 +520,9 @@ spaCy takes training data in JSON format. The built-in [`convert`](/api/cli#convert) command helps you convert the `.conllu` format used by the [Universal Dependencies corpora](https://github.com/UniversalDependencies) to -spaCy's training format. +spaCy's training format. To convert one or more existing `Doc` objects to +spaCy's JSON format, you can use the +[`gold.docs_to_json`](/api/goldparse#docs_to_json) helper. > #### Annotating entities > diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index a69e62219..7af134e40 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -284,9 +284,9 @@ same between pretraining and training. The API and errors around this need some improvement. ```bash -$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width] -[--depth] [--embed-rows] [--loss_func] [--dropout] [--seed] [--n-iter] [--use-vectors] -[--n-save_every] +$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] +[--width] [--depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] [--min-length] +[--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start] ``` | Argument | Type | Description | @@ -306,7 +306,8 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width] | `--n-iter`, `-i` | option | Number of iterations to pretrain. | | `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. | | `--n-save-every`, `-se` | option | Save model every X batches. | -| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.| +| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.| +| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files.| | **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. | ### JSONL format for raw text {#pretrain-jsonl} diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index f5a94335f..bf9801564 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -264,7 +264,7 @@ ancestor is found, e.g. if span excludes a necessary ancestor. | ----------- | -------------------------------------- | ----------------------------------------------- | | **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. | -## Doc.to_json {#to_json, tag="method" new="2.1"} +## Doc.to_json {#to_json tag="method" new="2.1"} Convert a Doc to JSON. The format it produces will be the new format for the [`spacy train`](/api/cli#train) command (not implemented yet). If custom diff --git a/website/docs/api/entityruler.md b/website/docs/api/entityruler.md index dcbf99da5..5c05450f8 100644 --- a/website/docs/api/entityruler.md +++ b/website/docs/api/entityruler.md @@ -30,14 +30,14 @@ be a token pattern (list) or a phrase pattern (string). For example: > ruler = EntityRuler(nlp, overwrite_ents=True) > ``` -| Name | Type | Description | -| ---------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. | -| `patterns` | iterable | Optional patterns to load in. | -| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phtasematcher). defaults to `None` -| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. | -| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. | -| **RETURNS** | `EntityRuler` | The newly constructed object. | +| Name | Type | Description | +| --------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | +| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. | +| `patterns` | iterable | Optional patterns to load in. | +| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phtasematcher). defaults to `None` | +| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. | +| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. | +| **RETURNS** | `EntityRuler` | The newly constructed object. | ## EntityRuler.\_\len\_\_ {#len tag="method"} @@ -123,35 +123,41 @@ of dicts) or a phrase pattern (string). For more details, see the usage guide on ## EntityRuler.to_disk {#to_disk tag="method"} Save the entity ruler patterns to a directory. The patterns will be saved as -newline-delimited JSON (JSONL). +newline-delimited JSON (JSONL). If a file with the suffix `.jsonl` is provided, +only the patterns are saved as JSONL. If a directory name is provided, a +`patterns.jsonl` and `cfg` file with the component configuration is exported. > #### Example > > ```python > ruler = EntityRuler(nlp) -> ruler.to_disk("/path/to/rules.jsonl") +> ruler.to_disk("/path/to/patterns.jsonl") # saves patterns only +> ruler.to_disk("/path/to/entity_ruler") # saves patterns and config > ``` -| Name | Type | Description | -| ------ | ---------------- | ---------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | +| Name | Type | Description | +| ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------- | +| `path` | unicode / `Path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | ## EntityRuler.from_disk {#from_disk tag="method"} -Load the entity ruler from a file. Expects a file containing newline-delimited -JSON (JSONL) with one entry per line. +Load the entity ruler from a file. Expects either a file containing +newline-delimited JSON (JSONL) with one entry per line, or a directory +containing a `patterns.jsonl` file and a `cfg` file with the component +configuration. > #### Example > > ```python > ruler = EntityRuler(nlp) -> ruler.from_disk("/path/to/rules.jsonl") +> ruler.from_disk("/path/to/patterns.jsonl") # loads patterns only +> ruler.from_disk("/path/to/entity_ruler") # loads patterns and config > ``` -| Name | Type | Description | -| ----------- | ---------------- | --------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a JSONL file. Paths may be either strings or `Path`-like objects. | -| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. | +| Name | Type | Description | +| ----------- | ---------------- | ---------------------------------------------------------------------------------------- | +| `path` | unicode / `Path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. | +| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. | ## EntityRuler.to_bytes {#to_bytes tag="method"} diff --git a/website/docs/api/goldparse.md b/website/docs/api/goldparse.md index ca5b6a811..13f68a85d 100644 --- a/website/docs/api/goldparse.md +++ b/website/docs/api/goldparse.md @@ -55,6 +55,27 @@ Whether the provided syntactic annotations form a projective dependency tree. ## Utilities {#util} +### gold.docs_to_json {#docs_to_json tag="function"} + +Convert a list of Doc objects into the +[JSON-serializable format](/api/annotation#json-input) used by the +[`spacy train`](/api/cli#train) command. + +> #### Example +> +> ```python +> from spacy.gold import docs_to_json +> +> doc = nlp(u"I like London") +> json_data = docs_to_json([doc]) +> ``` + +| Name | Type | Description | +| ----------- | ---------------- | ------------------------------------------ | +| `docs` | iterable / `Doc` | The `Doc` object(s) to convert. | +| `id` | int | ID to assign to the JSON. Defaults to `0`. | +| **RETURNS** | list | The data in spaCy's JSON format. | + ### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"} Encode labelled spans into per-token tags, using the diff --git a/website/docs/api/scorer.md b/website/docs/api/scorer.md index e6a8595fd..2af4ec0ce 100644 --- a/website/docs/api/scorer.md +++ b/website/docs/api/scorer.md @@ -46,13 +46,14 @@ Update the evaluation scores from a single [`Doc`](/api/doc) / ## Properties -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------------------------------------------------------- | -| `token_acc` | float | Tokenization accuracy. | -| `tags_acc` | float | Part-of-speech tag accuracy (fine grained tags, i.e. `Token.tag`). | -| `uas` | float | Unlabelled dependency score. | -| `las` | float | Labelled dependency score. | -| `ents_p` | float | Named entity accuracy (precision). | -| `ents_r` | float | Named entity accuracy (recall). | -| `ents_f` | float | Named entity accuracy (F-score). | -| `scores` | dict | All scores with keys `uas`, `las`, `ents_p`, `ents_r`, `ents_f`, `tags_acc` and `token_acc`. | +| Name | Type | Description | +| ---------------------------------------------- | ----- | ------------------------------------------------------------------------------------------------------------- | +| `token_acc` | float | Tokenization accuracy. | +| `tags_acc` | float | Part-of-speech tag accuracy (fine grained tags, i.e. `Token.tag`). | +| `uas` | float | Unlabelled dependency score. | +| `las` | float | Labelled dependency score. | +| `ents_p` | float | Named entity accuracy (precision). | +| `ents_r` | float | Named entity accuracy (recall). | +| `ents_f` | float | Named entity accuracy (F-score). | +| `ents_per_type` <Tag variant="new">2.1.5</Tag> | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. | +| `scores` | dict | All scores with keys `uas`, `las`, `ents_p`, `ents_r`, `ents_f`, `ents_per_type`, `tags_acc` and `token_acc`. | diff --git a/website/docs/api/tokenizer.md b/website/docs/api/tokenizer.md index 5bc0df625..67e67f5c9 100644 --- a/website/docs/api/tokenizer.md +++ b/website/docs/api/tokenizer.md @@ -9,7 +9,10 @@ Segment text, and create `Doc` objects with the discovered segment boundaries. ## Tokenizer.\_\_init\_\_ {#init tag="method"} -Create a `Tokenizer`, to create `Doc` objects given unicode text. +Create a `Tokenizer`, to create `Doc` objects given unicode text. For examples +of how to construct a custom tokenizer with different tokenization rules, see +the +[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers). > #### Example > @@ -18,11 +21,14 @@ Create a `Tokenizer`, to create `Doc` objects given unicode text. > from spacy.tokenizer import Tokenizer > from spacy.lang.en import English > nlp = English() +> # Create a blank Tokenizer with just the English vocab > tokenizer = Tokenizer(nlp.vocab) > > # Construction 2 > from spacy.lang.en import English > nlp = English() +> # Create a Tokenizer with the default settings for English +> # including punctuation rules and exceptions > tokenizer = nlp.Defaults.create_tokenizer(nlp) > ``` diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 91513588c..b84bf4e12 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -39,6 +39,9 @@ mkdir models python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json ``` +You can also use the [`gold.docs_to_json`](/api/goldparse#docs_to_json) helper +to convert a list of `Doc` objects to spaCy's JSON training format. + #### Understanding the training output When you train a model using the [`spacy train`](/api/cli#train) command, you'll @@ -630,13 +633,13 @@ should be somewhat larger, especially if your documents are long. ### Learning rate, regularization and gradient clipping {#tips-hyperparams} -By default spaCy uses the Adam solver, with default settings (`learn_rate=0.001`, -`beta1=0.9`, `beta2=0.999`). Some researchers have said they found -these settings terrible on their problems – but they've always performed very -well in training spaCy's models, in combination with the rest of our recipe. You -can change these settings directly, by modifying the corresponding attributes on -the `optimizer` object. You can also set environment variables, to adjust the -defaults. +By default spaCy uses the Adam solver, with default settings +(`learn_rate=0.001`, `beta1=0.9`, `beta2=0.999`). Some researchers have said +they found these settings terrible on their problems – but they've always +performed very well in training spaCy's models, in combination with the rest of +our recipe. You can change these settings directly, by modifying the +corresponding attributes on the `optimizer` object. You can also set environment +variables, to adjust the defaults. There are two other key hyper-parameters of the solver: `L2` **regularization**, and **gradient clipping** (`max_grad_norm`). Gradient clipping is a hack that's diff --git a/website/meta/languages.json b/website/meta/languages.json index cfa468d7f..ef336ef5f 100644 --- a/website/meta/languages.json +++ b/website/meta/languages.json @@ -104,6 +104,7 @@ { "code": "ga", "name": "Irish" }, { "code": "bn", "name": "Bengali", "has_examples": true }, { "code": "hi", "name": "Hindi", "example": "यह एक वाक्य है।", "has_examples": true }, + { "code": "mr", "name": "Marathi" }, { "code": "kn", "name": "Kannada" }, { "code": "ta", "name": "Tamil", "has_examples": true }, { @@ -153,6 +154,20 @@ "example": "これは文章です。", "has_examples": true }, + { + "code": "ko", + "name": "Korean", + "dependencies": [ + { + "name": "mecab-ko", + "url": "https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md" + }, + { "name": "mecab-ko-dic", "url": "https://bitbucket.org/eunjeon/mecab-ko-dic" }, + { "name": "natto-py", "url": "https://github.com/buruzaemon/natto-py" } + ], + "example": "이것은 문장입니다.", + "has_examples": true + }, { "code": "vi", "name": "Vietnamese", diff --git a/website/src/widgets/landing.js b/website/src/widgets/landing.js index f55aa5aa3..e9dec87f4 100644 --- a/website/src/widgets/landing.js +++ b/website/src/widgets/landing.js @@ -152,20 +152,21 @@ const Landing = ({ data }) => { <LandingBannerGrid> <LandingBanner title="spaCy IRL 2019: Two days of NLP" - label="Join us in Berlin" - to="https://irl.spacy.io/2019" - button="Get tickets" + label="Watch the videos" + to="https://www.youtube.com/playlist?list=PLBmcuObd5An4UC6jvK_-eSl6jCvP1gwXc" + button="Watch the videos" background="#ffc194" backgroundImage={irlBackground} color="#1a1e23" small > - We're pleased to invite the spaCy community and other folks working on Natural + We were pleased to invite the spaCy community and other folks working on Natural Language Processing to Berlin this summer for a small and intimate event{' '} - <strong>July 5-6, 2019</strong>. The event includes a hands-on training day for - teams using spaCy in production, followed by a one-track conference. We've - booked a beautiful venue, hand-picked an awesome lineup of speakers and - scheduled plenty of social time to get to know each other and exchange ideas. + <strong>July 6, 2019</strong>. We booked a beautiful venue, hand-picked an + awesome lineup of speakers and scheduled plenty of social time to get to know + each other and exchange ideas. The YouTube playlist includes 12 talks about NLP + research, development and applications, with keynotes by Sebastian Ruder + (DeepMind) and Yoav Goldberg (Allen AI). </LandingBanner> <LandingBanner