mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
Feature toggle_pipes (#5378)
* make disable_pipes deprecated in favour of the new toggle_pipes * rewrite disable_pipes statements * update documentation * remove bin/wiki_entity_linking folder * one more fix * remove deprecated link to documentation * few more doc fixes * add note about name change to the docs * restore original disable_pipes * small fixes * fix typo * fix error number to W096 * rename to select_pipes * also make changes to the documentation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
This commit is contained in:
parent
333b1a308b
commit
0d94737857
|
@ -1,37 +0,0 @@
|
||||||
## Entity Linking with Wikipedia and Wikidata
|
|
||||||
|
|
||||||
### Step 1: Create a Knowledge Base (KB) and training data
|
|
||||||
|
|
||||||
Run `wikidata_pretrain_kb.py`
|
|
||||||
* This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
|
|
||||||
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
|
||||||
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
|
||||||
* You can set the filtering parameters for KB construction:
|
|
||||||
* `max_per_alias` (`-a`): (max) number of candidate entities in the KB per alias/synonym
|
|
||||||
* `min_freq` (`-f`): threshold of number of times an entity should occur in the corpus to be included in the KB
|
|
||||||
* `min_pair` (`-c`): threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
|
|
||||||
* Further parameters to set:
|
|
||||||
* `descriptions_from_wikipedia` (`-wp`): whether to parse descriptions from Wikipedia (`True`) or Wikidata (`False`)
|
|
||||||
* `entity_vector_length` (`-v`): length of the pre-trained entity description vectors
|
|
||||||
* `lang` (`-la`): language for which to fetch Wikidata information (as the dump contains all languages)
|
|
||||||
|
|
||||||
Quick testing and rerunning:
|
|
||||||
* When trying out the pipeline for a quick test, set `limit_prior` (`-lp`), `limit_train` (`-lt`) and/or `limit_wd` (`-lw`) to read only parts of the dumps instead of everything.
|
|
||||||
* e.g. set `-lt 20000 -lp 2000 -lw 3000 -f 1`
|
|
||||||
* If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.
|
|
||||||
|
|
||||||
|
|
||||||
### Step 2: Train an Entity Linking model
|
|
||||||
|
|
||||||
Run `wikidata_train_entity_linker.py`
|
|
||||||
* This takes the **KB directory** produced by Step 1, and trains an **Entity Linking model**
|
|
||||||
* Specify the output directory (`-o`) in which the final, trained model will be saved
|
|
||||||
* You can set the learning parameters for the EL training:
|
|
||||||
* `epochs` (`-e`): number of training iterations
|
|
||||||
* `dropout` (`-p`): dropout rate
|
|
||||||
* `lr` (`-n`): learning rate
|
|
||||||
* `l2` (`-r`): L2 regularization
|
|
||||||
* Specify the number of training and dev testing articles with `train_articles` (`-t`) and `dev_articles` (`-d`) respectively
|
|
||||||
* If not specified, the full dataset will be processed - this may take a LONG time !
|
|
||||||
* Further parameters to set:
|
|
||||||
* `labels_discard` (`-l`): NER label types to discard during training
|
|
|
@ -1,12 +0,0 @@
|
||||||
TRAINING_DATA_FILE = "gold_entities.jsonl"
|
|
||||||
KB_FILE = "kb"
|
|
||||||
KB_MODEL_DIR = "nlp_kb"
|
|
||||||
OUTPUT_MODEL_DIR = "nlp"
|
|
||||||
|
|
||||||
PRIOR_PROB_PATH = "prior_prob.csv"
|
|
||||||
ENTITY_DEFS_PATH = "entity_defs.csv"
|
|
||||||
ENTITY_FREQ_PATH = "entity_freq.csv"
|
|
||||||
ENTITY_ALIAS_PATH = "entity_alias.csv"
|
|
||||||
ENTITY_DESCR_PATH = "entity_descriptions.csv"
|
|
||||||
|
|
||||||
LOG_FORMAT = '%(asctime)s - %(levelname)s - %(name)s - %(message)s'
|
|
|
@ -1,204 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import logging
|
|
||||||
import random
|
|
||||||
from tqdm import tqdm
|
|
||||||
from collections import defaultdict
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
|
|
||||||
class Metrics(object):
|
|
||||||
true_pos = 0
|
|
||||||
false_pos = 0
|
|
||||||
false_neg = 0
|
|
||||||
|
|
||||||
def update_results(self, true_entity, candidate):
|
|
||||||
candidate_is_correct = true_entity == candidate
|
|
||||||
|
|
||||||
# Assume that we have no labeled negatives in the data (i.e. cases where true_entity is "NIL")
|
|
||||||
# Therefore, if candidate_is_correct then we have a true positive and never a true negative.
|
|
||||||
self.true_pos += candidate_is_correct
|
|
||||||
self.false_neg += not candidate_is_correct
|
|
||||||
if candidate and candidate not in {"", "NIL"}:
|
|
||||||
# A wrong prediction (e.g. Q42 != Q3) counts both as a FP as well as a FN.
|
|
||||||
self.false_pos += not candidate_is_correct
|
|
||||||
|
|
||||||
def calculate_precision(self):
|
|
||||||
if self.true_pos == 0:
|
|
||||||
return 0.0
|
|
||||||
else:
|
|
||||||
return self.true_pos / (self.true_pos + self.false_pos)
|
|
||||||
|
|
||||||
def calculate_recall(self):
|
|
||||||
if self.true_pos == 0:
|
|
||||||
return 0.0
|
|
||||||
else:
|
|
||||||
return self.true_pos / (self.true_pos + self.false_neg)
|
|
||||||
|
|
||||||
def calculate_fscore(self):
|
|
||||||
p = self.calculate_precision()
|
|
||||||
r = self.calculate_recall()
|
|
||||||
if p + r == 0:
|
|
||||||
return 0.0
|
|
||||||
else:
|
|
||||||
return 2 * p * r / (p + r)
|
|
||||||
|
|
||||||
|
|
||||||
class EvaluationResults(object):
|
|
||||||
def __init__(self):
|
|
||||||
self.metrics = Metrics()
|
|
||||||
self.metrics_by_label = defaultdict(Metrics)
|
|
||||||
|
|
||||||
def update_metrics(self, ent_label, true_entity, candidate):
|
|
||||||
self.metrics.update_results(true_entity, candidate)
|
|
||||||
self.metrics_by_label[ent_label].update_results(true_entity, candidate)
|
|
||||||
|
|
||||||
def report_metrics(self, model_name):
|
|
||||||
model_str = model_name.title()
|
|
||||||
recall = self.metrics.calculate_recall()
|
|
||||||
precision = self.metrics.calculate_precision()
|
|
||||||
fscore = self.metrics.calculate_fscore()
|
|
||||||
return (
|
|
||||||
"{}: ".format(model_str)
|
|
||||||
+ "F-score = {} | ".format(round(fscore, 3))
|
|
||||||
+ "Recall = {} | ".format(round(recall, 3))
|
|
||||||
+ "Precision = {} | ".format(round(precision, 3))
|
|
||||||
+ "F-score by label = {}".format(
|
|
||||||
{k: v.calculate_fscore() for k, v in sorted(self.metrics_by_label.items())}
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class BaselineResults(object):
|
|
||||||
def __init__(self):
|
|
||||||
self.random = EvaluationResults()
|
|
||||||
self.prior = EvaluationResults()
|
|
||||||
self.oracle = EvaluationResults()
|
|
||||||
|
|
||||||
def report_performance(self, model):
|
|
||||||
results = getattr(self, model)
|
|
||||||
return results.report_metrics(model)
|
|
||||||
|
|
||||||
def update_baselines(
|
|
||||||
self,
|
|
||||||
true_entity,
|
|
||||||
ent_label,
|
|
||||||
random_candidate,
|
|
||||||
prior_candidate,
|
|
||||||
oracle_candidate,
|
|
||||||
):
|
|
||||||
self.oracle.update_metrics(ent_label, true_entity, oracle_candidate)
|
|
||||||
self.prior.update_metrics(ent_label, true_entity, prior_candidate)
|
|
||||||
self.random.update_metrics(ent_label, true_entity, random_candidate)
|
|
||||||
|
|
||||||
|
|
||||||
def measure_performance(dev_data, kb, el_pipe, baseline=True, context=True, dev_limit=None):
|
|
||||||
counts = dict()
|
|
||||||
baseline_results = BaselineResults()
|
|
||||||
context_results = EvaluationResults()
|
|
||||||
combo_results = EvaluationResults()
|
|
||||||
|
|
||||||
for doc, gold in tqdm(dev_data, total=dev_limit, leave=False, desc='Processing dev data'):
|
|
||||||
if len(doc) > 0:
|
|
||||||
correct_ents = dict()
|
|
||||||
for entity, kb_dict in gold.links.items():
|
|
||||||
start, end = entity
|
|
||||||
for gold_kb, value in kb_dict.items():
|
|
||||||
if value:
|
|
||||||
# only evaluating on positive examples
|
|
||||||
offset = _offset(start, end)
|
|
||||||
correct_ents[offset] = gold_kb
|
|
||||||
|
|
||||||
if baseline:
|
|
||||||
_add_baseline(baseline_results, counts, doc, correct_ents, kb)
|
|
||||||
|
|
||||||
if context:
|
|
||||||
# using only context
|
|
||||||
el_pipe.cfg["incl_context"] = True
|
|
||||||
el_pipe.cfg["incl_prior"] = False
|
|
||||||
_add_eval_result(context_results, doc, correct_ents, el_pipe)
|
|
||||||
|
|
||||||
# measuring combined accuracy (prior + context)
|
|
||||||
el_pipe.cfg["incl_context"] = True
|
|
||||||
el_pipe.cfg["incl_prior"] = True
|
|
||||||
_add_eval_result(combo_results, doc, correct_ents, el_pipe)
|
|
||||||
|
|
||||||
if baseline:
|
|
||||||
logger.info("Counts: {}".format({k: v for k, v in sorted(counts.items())}))
|
|
||||||
logger.info(baseline_results.report_performance("random"))
|
|
||||||
logger.info(baseline_results.report_performance("prior"))
|
|
||||||
logger.info(baseline_results.report_performance("oracle"))
|
|
||||||
|
|
||||||
if context:
|
|
||||||
logger.info(context_results.report_metrics("context only"))
|
|
||||||
logger.info(combo_results.report_metrics("context and prior"))
|
|
||||||
|
|
||||||
|
|
||||||
def _add_eval_result(results, doc, correct_ents, el_pipe):
|
|
||||||
"""
|
|
||||||
Evaluate the ent.kb_id_ annotations against the gold standard.
|
|
||||||
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
doc = el_pipe(doc)
|
|
||||||
for ent in doc.ents:
|
|
||||||
ent_label = ent.label_
|
|
||||||
start = ent.start_char
|
|
||||||
end = ent.end_char
|
|
||||||
offset = _offset(start, end)
|
|
||||||
gold_entity = correct_ents.get(offset, None)
|
|
||||||
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
|
|
||||||
if gold_entity is not None:
|
|
||||||
pred_entity = ent.kb_id_
|
|
||||||
results.update_metrics(ent_label, gold_entity, pred_entity)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logging.error("Error assessing accuracy " + str(e))
|
|
||||||
|
|
||||||
|
|
||||||
def _add_baseline(baseline_results, counts, doc, correct_ents, kb):
|
|
||||||
"""
|
|
||||||
Measure 3 performance baselines: random selection, prior probabilities, and 'oracle' prediction for upper bound.
|
|
||||||
Only evaluate entities that overlap between gold and NER, to isolate the performance of the NEL.
|
|
||||||
"""
|
|
||||||
for ent in doc.ents:
|
|
||||||
ent_label = ent.label_
|
|
||||||
start = ent.start_char
|
|
||||||
end = ent.end_char
|
|
||||||
offset = _offset(start, end)
|
|
||||||
gold_entity = correct_ents.get(offset, None)
|
|
||||||
|
|
||||||
# the gold annotations are not complete so we can't evaluate missing annotations as 'wrong'
|
|
||||||
if gold_entity is not None:
|
|
||||||
candidates = kb.get_candidates(ent.text)
|
|
||||||
oracle_candidate = ""
|
|
||||||
prior_candidate = ""
|
|
||||||
random_candidate = ""
|
|
||||||
if candidates:
|
|
||||||
scores = []
|
|
||||||
|
|
||||||
for c in candidates:
|
|
||||||
scores.append(c.prior_prob)
|
|
||||||
if c.entity_ == gold_entity:
|
|
||||||
oracle_candidate = c.entity_
|
|
||||||
|
|
||||||
best_index = scores.index(max(scores))
|
|
||||||
prior_candidate = candidates[best_index].entity_
|
|
||||||
random_candidate = random.choice(candidates).entity_
|
|
||||||
|
|
||||||
current_count = counts.get(ent_label, 0)
|
|
||||||
counts[ent_label] = current_count+1
|
|
||||||
|
|
||||||
baseline_results.update_baselines(
|
|
||||||
gold_entity,
|
|
||||||
ent_label,
|
|
||||||
random_candidate,
|
|
||||||
prior_candidate,
|
|
||||||
oracle_candidate,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def _offset(start, end):
|
|
||||||
return "{}_{}".format(start, end)
|
|
|
@ -1,161 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import logging
|
|
||||||
|
|
||||||
from spacy.kb import KnowledgeBase
|
|
||||||
|
|
||||||
from bin.wiki_entity_linking.train_descriptions import EntityEncoder
|
|
||||||
from bin.wiki_entity_linking import wiki_io as io
|
|
||||||
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
|
|
||||||
def create_kb(
|
|
||||||
nlp,
|
|
||||||
max_entities_per_alias,
|
|
||||||
min_entity_freq,
|
|
||||||
min_occ,
|
|
||||||
entity_def_path,
|
|
||||||
entity_descr_path,
|
|
||||||
entity_alias_path,
|
|
||||||
entity_freq_path,
|
|
||||||
prior_prob_path,
|
|
||||||
entity_vector_length,
|
|
||||||
):
|
|
||||||
# Create the knowledge base from Wikidata entries
|
|
||||||
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=entity_vector_length)
|
|
||||||
entity_list, filtered_title_to_id = _define_entities(nlp, kb, entity_def_path, entity_descr_path, min_entity_freq, entity_freq_path, entity_vector_length)
|
|
||||||
_define_aliases(kb, entity_alias_path, entity_list, filtered_title_to_id, max_entities_per_alias, min_occ, prior_prob_path)
|
|
||||||
return kb
|
|
||||||
|
|
||||||
|
|
||||||
def _define_entities(nlp, kb, entity_def_path, entity_descr_path, min_entity_freq, entity_freq_path, entity_vector_length):
|
|
||||||
# read the mappings from file
|
|
||||||
title_to_id = io.read_title_to_id(entity_def_path)
|
|
||||||
id_to_descr = io.read_id_to_descr(entity_descr_path)
|
|
||||||
|
|
||||||
# check the length of the nlp vectors
|
|
||||||
if "vectors" in nlp.meta and nlp.vocab.vectors.size:
|
|
||||||
input_dim = nlp.vocab.vectors_length
|
|
||||||
logger.info("Loaded pretrained vectors of size %s" % input_dim)
|
|
||||||
else:
|
|
||||||
raise ValueError(
|
|
||||||
"The `nlp` object should have access to pretrained word vectors, "
|
|
||||||
" cf. https://spacy.io/usage/models#languages."
|
|
||||||
)
|
|
||||||
|
|
||||||
logger.info("Filtering entities with fewer than {} mentions or no description".format(min_entity_freq))
|
|
||||||
entity_frequencies = io.read_entity_to_count(entity_freq_path)
|
|
||||||
# filter the entities for in the KB by frequency, because there's just too much data (8M entities) otherwise
|
|
||||||
filtered_title_to_id, entity_list, description_list, frequency_list = get_filtered_entities(
|
|
||||||
title_to_id,
|
|
||||||
id_to_descr,
|
|
||||||
entity_frequencies,
|
|
||||||
min_entity_freq
|
|
||||||
)
|
|
||||||
logger.info("Kept {} entities from the set of {}".format(len(description_list), len(title_to_id.keys())))
|
|
||||||
|
|
||||||
logger.info("Training entity encoder")
|
|
||||||
encoder = EntityEncoder(nlp, input_dim, entity_vector_length)
|
|
||||||
encoder.train(description_list=description_list, to_print=True)
|
|
||||||
|
|
||||||
logger.info("Getting entity embeddings")
|
|
||||||
embeddings = encoder.apply_encoder(description_list)
|
|
||||||
|
|
||||||
logger.info("Adding {} entities".format(len(entity_list)))
|
|
||||||
kb.set_entities(
|
|
||||||
entity_list=entity_list, freq_list=frequency_list, vector_list=embeddings
|
|
||||||
)
|
|
||||||
return entity_list, filtered_title_to_id
|
|
||||||
|
|
||||||
|
|
||||||
def _define_aliases(kb, entity_alias_path, entity_list, filtered_title_to_id, max_entities_per_alias, min_occ, prior_prob_path):
|
|
||||||
logger.info("Adding aliases from Wikipedia and Wikidata")
|
|
||||||
_add_aliases(
|
|
||||||
kb,
|
|
||||||
entity_list=entity_list,
|
|
||||||
title_to_id=filtered_title_to_id,
|
|
||||||
max_entities_per_alias=max_entities_per_alias,
|
|
||||||
min_occ=min_occ,
|
|
||||||
prior_prob_path=prior_prob_path,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def get_filtered_entities(title_to_id, id_to_descr, entity_frequencies,
|
|
||||||
min_entity_freq: int = 10):
|
|
||||||
filtered_title_to_id = dict()
|
|
||||||
entity_list = []
|
|
||||||
description_list = []
|
|
||||||
frequency_list = []
|
|
||||||
for title, entity in title_to_id.items():
|
|
||||||
freq = entity_frequencies.get(title, 0)
|
|
||||||
desc = id_to_descr.get(entity, None)
|
|
||||||
if desc and freq > min_entity_freq:
|
|
||||||
entity_list.append(entity)
|
|
||||||
description_list.append(desc)
|
|
||||||
frequency_list.append(freq)
|
|
||||||
filtered_title_to_id[title] = entity
|
|
||||||
return filtered_title_to_id, entity_list, description_list, frequency_list
|
|
||||||
|
|
||||||
|
|
||||||
def _add_aliases(kb, entity_list, title_to_id, max_entities_per_alias, min_occ, prior_prob_path):
|
|
||||||
wp_titles = title_to_id.keys()
|
|
||||||
|
|
||||||
# adding aliases with prior probabilities
|
|
||||||
# we can read this file sequentially, it's sorted by alias, and then by count
|
|
||||||
logger.info("Adding WP aliases")
|
|
||||||
with prior_prob_path.open("r", encoding="utf8") as prior_file:
|
|
||||||
# skip header
|
|
||||||
prior_file.readline()
|
|
||||||
line = prior_file.readline()
|
|
||||||
previous_alias = None
|
|
||||||
total_count = 0
|
|
||||||
counts = []
|
|
||||||
entities = []
|
|
||||||
while line:
|
|
||||||
splits = line.replace("\n", "").split(sep="|")
|
|
||||||
new_alias = splits[0]
|
|
||||||
count = int(splits[1])
|
|
||||||
entity = splits[2]
|
|
||||||
|
|
||||||
if new_alias != previous_alias and previous_alias:
|
|
||||||
# done reading the previous alias --> output
|
|
||||||
if len(entities) > 0:
|
|
||||||
selected_entities = []
|
|
||||||
prior_probs = []
|
|
||||||
for ent_count, ent_string in zip(counts, entities):
|
|
||||||
if ent_string in wp_titles:
|
|
||||||
wd_id = title_to_id[ent_string]
|
|
||||||
p_entity_givenalias = ent_count / total_count
|
|
||||||
selected_entities.append(wd_id)
|
|
||||||
prior_probs.append(p_entity_givenalias)
|
|
||||||
|
|
||||||
if selected_entities:
|
|
||||||
try:
|
|
||||||
kb.add_alias(
|
|
||||||
alias=previous_alias,
|
|
||||||
entities=selected_entities,
|
|
||||||
probabilities=prior_probs,
|
|
||||||
)
|
|
||||||
except ValueError as e:
|
|
||||||
logger.error(e)
|
|
||||||
total_count = 0
|
|
||||||
counts = []
|
|
||||||
entities = []
|
|
||||||
|
|
||||||
total_count += count
|
|
||||||
|
|
||||||
if len(entities) < max_entities_per_alias and count >= min_occ:
|
|
||||||
counts.append(count)
|
|
||||||
entities.append(entity)
|
|
||||||
previous_alias = new_alias
|
|
||||||
|
|
||||||
line = prior_file.readline()
|
|
||||||
|
|
||||||
|
|
||||||
def read_kb(nlp, kb_file):
|
|
||||||
kb = KnowledgeBase(vocab=nlp.vocab)
|
|
||||||
kb.load_bulk(kb_file)
|
|
||||||
return kb
|
|
|
@ -1,145 +0,0 @@
|
||||||
from random import shuffle
|
|
||||||
|
|
||||||
import logging
|
|
||||||
import numpy as np
|
|
||||||
|
|
||||||
from thinc.api import Model, chain, CosineDistance, Linear
|
|
||||||
|
|
||||||
from spacy.util import create_default_optimizer
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
|
|
||||||
class EntityEncoder:
|
|
||||||
"""
|
|
||||||
Train the embeddings of entity descriptions to fit a fixed-size entity vector (e.g. 64D).
|
|
||||||
This entity vector will be stored in the KB, for further downstream use in the entity model.
|
|
||||||
"""
|
|
||||||
|
|
||||||
DROP = 0
|
|
||||||
BATCH_SIZE = 1000
|
|
||||||
|
|
||||||
# Set min. acceptable loss to avoid a 'mean of empty slice' warning by numpy
|
|
||||||
MIN_LOSS = 0.01
|
|
||||||
|
|
||||||
# Reasonable default to stop training when things are not improving
|
|
||||||
MAX_NO_IMPROVEMENT = 20
|
|
||||||
|
|
||||||
def __init__(self, nlp, input_dim, desc_width, epochs=5):
|
|
||||||
self.nlp = nlp
|
|
||||||
self.input_dim = input_dim
|
|
||||||
self.desc_width = desc_width
|
|
||||||
self.epochs = epochs
|
|
||||||
self.distance = CosineDistance(ignore_zeros=True, normalize=False)
|
|
||||||
|
|
||||||
def apply_encoder(self, description_list):
|
|
||||||
if self.encoder is None:
|
|
||||||
raise ValueError("Can not apply encoder before training it")
|
|
||||||
|
|
||||||
batch_size = 100000
|
|
||||||
|
|
||||||
start = 0
|
|
||||||
stop = min(batch_size, len(description_list))
|
|
||||||
encodings = []
|
|
||||||
|
|
||||||
while start < len(description_list):
|
|
||||||
docs = list(self.nlp.pipe(description_list[start:stop]))
|
|
||||||
doc_embeddings = [self._get_doc_embedding(doc) for doc in docs]
|
|
||||||
enc = self.encoder(np.asarray(doc_embeddings))
|
|
||||||
encodings.extend(enc.tolist())
|
|
||||||
|
|
||||||
start = start + batch_size
|
|
||||||
stop = min(stop + batch_size, len(description_list))
|
|
||||||
logger.info("Encoded: {} entities".format(stop))
|
|
||||||
|
|
||||||
return encodings
|
|
||||||
|
|
||||||
def train(self, description_list, to_print=False):
|
|
||||||
processed, loss = self._train_model(description_list)
|
|
||||||
if to_print:
|
|
||||||
logger.info(
|
|
||||||
"Trained entity descriptions on {} ".format(processed) +
|
|
||||||
"(non-unique) descriptions across {} ".format(self.epochs) +
|
|
||||||
"epochs"
|
|
||||||
)
|
|
||||||
logger.info("Final loss: {}".format(loss))
|
|
||||||
|
|
||||||
def _train_model(self, description_list):
|
|
||||||
best_loss = 1.0
|
|
||||||
iter_since_best = 0
|
|
||||||
self._build_network(self.input_dim, self.desc_width)
|
|
||||||
|
|
||||||
processed = 0
|
|
||||||
loss = 1
|
|
||||||
# copy this list so that shuffling does not affect other functions
|
|
||||||
descriptions = description_list.copy()
|
|
||||||
to_continue = True
|
|
||||||
|
|
||||||
for i in range(self.epochs):
|
|
||||||
shuffle(descriptions)
|
|
||||||
|
|
||||||
batch_nr = 0
|
|
||||||
start = 0
|
|
||||||
stop = min(self.BATCH_SIZE, len(descriptions))
|
|
||||||
|
|
||||||
while to_continue and start < len(descriptions):
|
|
||||||
batch = []
|
|
||||||
for descr in descriptions[start:stop]:
|
|
||||||
doc = self.nlp(descr)
|
|
||||||
doc_vector = self._get_doc_embedding(doc)
|
|
||||||
batch.append(doc_vector)
|
|
||||||
|
|
||||||
loss = self._update(batch)
|
|
||||||
if batch_nr % 25 == 0:
|
|
||||||
logger.info("loss: {} ".format(loss))
|
|
||||||
processed += len(batch)
|
|
||||||
|
|
||||||
# in general, continue training if we haven't reached our ideal min yet
|
|
||||||
to_continue = loss > self.MIN_LOSS
|
|
||||||
|
|
||||||
# store the best loss and track how long it's been
|
|
||||||
if loss < best_loss:
|
|
||||||
best_loss = loss
|
|
||||||
iter_since_best = 0
|
|
||||||
else:
|
|
||||||
iter_since_best += 1
|
|
||||||
|
|
||||||
# stop learning if we haven't seen improvement since the last few iterations
|
|
||||||
if iter_since_best > self.MAX_NO_IMPROVEMENT:
|
|
||||||
to_continue = False
|
|
||||||
|
|
||||||
batch_nr += 1
|
|
||||||
start = start + self.BATCH_SIZE
|
|
||||||
stop = min(stop + self.BATCH_SIZE, len(descriptions))
|
|
||||||
|
|
||||||
return processed, loss
|
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def _get_doc_embedding(doc):
|
|
||||||
indices = np.zeros((len(doc),), dtype="i")
|
|
||||||
for i, word in enumerate(doc):
|
|
||||||
if word.orth in doc.vocab.vectors.key2row:
|
|
||||||
indices[i] = doc.vocab.vectors.key2row[word.orth]
|
|
||||||
else:
|
|
||||||
indices[i] = 0
|
|
||||||
word_vectors = doc.vocab.vectors.data[indices]
|
|
||||||
doc_vector = np.mean(word_vectors, axis=0)
|
|
||||||
return doc_vector
|
|
||||||
|
|
||||||
def _build_network(self, orig_width, hidden_with):
|
|
||||||
with Model.define_operators({">>": chain}):
|
|
||||||
# very simple encoder-decoder model
|
|
||||||
self.encoder = Linear(hidden_with, orig_width)
|
|
||||||
# TODO: removed the zero_init here - is oK?
|
|
||||||
self.model = self.encoder >> Linear(orig_width, hidden_with)
|
|
||||||
self.sgd = create_default_optimizer()
|
|
||||||
|
|
||||||
def _update(self, vectors):
|
|
||||||
truths = self.model.ops.asarray(vectors)
|
|
||||||
predictions, bp_model = self.model.begin_update(
|
|
||||||
truths, drop=self.DROP
|
|
||||||
)
|
|
||||||
d_scores, loss = self.distance(predictions, truths)
|
|
||||||
bp_model(d_scores, sgd=self.sgd)
|
|
||||||
return loss / len(vectors)
|
|
||||||
|
|
|
@ -1,127 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import csv
|
|
||||||
|
|
||||||
# min() needed to prevent error on windows, cf https://stackoverflow.com/questions/52404416/
|
|
||||||
csv.field_size_limit(min(sys.maxsize, 2147483646))
|
|
||||||
|
|
||||||
""" This class provides reading/writing methods for temp files """
|
|
||||||
|
|
||||||
|
|
||||||
# Entity definition: WP title -> WD ID #
|
|
||||||
def write_title_to_id(entity_def_output, title_to_id):
|
|
||||||
with entity_def_output.open("w", encoding="utf8") as id_file:
|
|
||||||
id_file.write("WP_title" + "|" + "WD_id" + "\n")
|
|
||||||
for title, qid in title_to_id.items():
|
|
||||||
id_file.write(title + "|" + str(qid) + "\n")
|
|
||||||
|
|
||||||
|
|
||||||
def read_title_to_id(entity_def_output):
|
|
||||||
title_to_id = dict()
|
|
||||||
with entity_def_output.open("r", encoding="utf8") as id_file:
|
|
||||||
csvreader = csv.reader(id_file, delimiter="|")
|
|
||||||
# skip header
|
|
||||||
next(csvreader)
|
|
||||||
for row in csvreader:
|
|
||||||
title_to_id[row[0]] = row[1]
|
|
||||||
return title_to_id
|
|
||||||
|
|
||||||
|
|
||||||
# Entity aliases from WD: WD ID -> WD alias #
|
|
||||||
def write_id_to_alias(entity_alias_path, id_to_alias):
|
|
||||||
with entity_alias_path.open("w", encoding="utf8") as alias_file:
|
|
||||||
alias_file.write("WD_id" + "|" + "alias" + "\n")
|
|
||||||
for qid, alias_list in id_to_alias.items():
|
|
||||||
for alias in alias_list:
|
|
||||||
alias_file.write(str(qid) + "|" + alias + "\n")
|
|
||||||
|
|
||||||
|
|
||||||
def read_id_to_alias(entity_alias_path):
|
|
||||||
id_to_alias = dict()
|
|
||||||
with entity_alias_path.open("r", encoding="utf8") as alias_file:
|
|
||||||
csvreader = csv.reader(alias_file, delimiter="|")
|
|
||||||
# skip header
|
|
||||||
next(csvreader)
|
|
||||||
for row in csvreader:
|
|
||||||
qid = row[0]
|
|
||||||
alias = row[1]
|
|
||||||
alias_list = id_to_alias.get(qid, [])
|
|
||||||
alias_list.append(alias)
|
|
||||||
id_to_alias[qid] = alias_list
|
|
||||||
return id_to_alias
|
|
||||||
|
|
||||||
|
|
||||||
def read_alias_to_id_generator(entity_alias_path):
|
|
||||||
""" Read (aliases, qid) tuples """
|
|
||||||
|
|
||||||
with entity_alias_path.open("r", encoding="utf8") as alias_file:
|
|
||||||
csvreader = csv.reader(alias_file, delimiter="|")
|
|
||||||
# skip header
|
|
||||||
next(csvreader)
|
|
||||||
for row in csvreader:
|
|
||||||
qid = row[0]
|
|
||||||
alias = row[1]
|
|
||||||
yield alias, qid
|
|
||||||
|
|
||||||
|
|
||||||
# Entity descriptions from WD: WD ID -> WD alias #
|
|
||||||
def write_id_to_descr(entity_descr_output, id_to_descr):
|
|
||||||
with entity_descr_output.open("w", encoding="utf8") as descr_file:
|
|
||||||
descr_file.write("WD_id" + "|" + "description" + "\n")
|
|
||||||
for qid, descr in id_to_descr.items():
|
|
||||||
descr_file.write(str(qid) + "|" + descr + "\n")
|
|
||||||
|
|
||||||
|
|
||||||
def read_id_to_descr(entity_desc_path):
|
|
||||||
id_to_desc = dict()
|
|
||||||
with entity_desc_path.open("r", encoding="utf8") as descr_file:
|
|
||||||
csvreader = csv.reader(descr_file, delimiter="|")
|
|
||||||
# skip header
|
|
||||||
next(csvreader)
|
|
||||||
for row in csvreader:
|
|
||||||
id_to_desc[row[0]] = row[1]
|
|
||||||
return id_to_desc
|
|
||||||
|
|
||||||
|
|
||||||
# Entity counts from WP: WP title -> count #
|
|
||||||
def write_entity_to_count(prior_prob_input, count_output):
|
|
||||||
# Write entity counts for quick access later
|
|
||||||
entity_to_count = dict()
|
|
||||||
total_count = 0
|
|
||||||
|
|
||||||
with prior_prob_input.open("r", encoding="utf8") as prior_file:
|
|
||||||
# skip header
|
|
||||||
prior_file.readline()
|
|
||||||
line = prior_file.readline()
|
|
||||||
|
|
||||||
while line:
|
|
||||||
splits = line.replace("\n", "").split(sep="|")
|
|
||||||
# alias = splits[0]
|
|
||||||
count = int(splits[1])
|
|
||||||
entity = splits[2]
|
|
||||||
|
|
||||||
current_count = entity_to_count.get(entity, 0)
|
|
||||||
entity_to_count[entity] = current_count + count
|
|
||||||
|
|
||||||
total_count += count
|
|
||||||
|
|
||||||
line = prior_file.readline()
|
|
||||||
|
|
||||||
with count_output.open("w", encoding="utf8") as entity_file:
|
|
||||||
entity_file.write("entity" + "|" + "count" + "\n")
|
|
||||||
for entity, count in entity_to_count.items():
|
|
||||||
entity_file.write(entity + "|" + str(count) + "\n")
|
|
||||||
|
|
||||||
|
|
||||||
def read_entity_to_count(count_input):
|
|
||||||
entity_to_count = dict()
|
|
||||||
with count_input.open("r", encoding="utf8") as csvfile:
|
|
||||||
csvreader = csv.reader(csvfile, delimiter="|")
|
|
||||||
# skip header
|
|
||||||
next(csvreader)
|
|
||||||
for row in csvreader:
|
|
||||||
entity_to_count[row[0]] = int(row[1])
|
|
||||||
|
|
||||||
return entity_to_count
|
|
|
@ -1,128 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
# List of meta pages in Wikidata, should be kept out of the Knowledge base
|
|
||||||
WD_META_ITEMS = [
|
|
||||||
"Q163875",
|
|
||||||
"Q191780",
|
|
||||||
"Q224414",
|
|
||||||
"Q4167836",
|
|
||||||
"Q4167410",
|
|
||||||
"Q4663903",
|
|
||||||
"Q11266439",
|
|
||||||
"Q13406463",
|
|
||||||
"Q15407973",
|
|
||||||
"Q18616576",
|
|
||||||
"Q19887878",
|
|
||||||
"Q22808320",
|
|
||||||
"Q23894233",
|
|
||||||
"Q33120876",
|
|
||||||
"Q42104522",
|
|
||||||
"Q47460393",
|
|
||||||
"Q64875536",
|
|
||||||
"Q66480449",
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
# TODO: add more cases from non-English WP's
|
|
||||||
|
|
||||||
# List of prefixes that refer to Wikipedia "file" pages
|
|
||||||
WP_FILE_NAMESPACE = ["Bestand", "File"]
|
|
||||||
|
|
||||||
# List of prefixes that refer to Wikipedia "category" pages
|
|
||||||
WP_CATEGORY_NAMESPACE = ["Kategori", "Category", "Categorie"]
|
|
||||||
|
|
||||||
# List of prefixes that refer to Wikipedia "meta" pages
|
|
||||||
# these will/should be matched ignoring case
|
|
||||||
WP_META_NAMESPACE = (
|
|
||||||
WP_FILE_NAMESPACE
|
|
||||||
+ WP_CATEGORY_NAMESPACE
|
|
||||||
+ [
|
|
||||||
"b",
|
|
||||||
"betawikiversity",
|
|
||||||
"Book",
|
|
||||||
"c",
|
|
||||||
"Commons",
|
|
||||||
"d",
|
|
||||||
"dbdump",
|
|
||||||
"download",
|
|
||||||
"Draft",
|
|
||||||
"Education",
|
|
||||||
"Foundation",
|
|
||||||
"Gadget",
|
|
||||||
"Gadget definition",
|
|
||||||
"Gebruiker",
|
|
||||||
"gerrit",
|
|
||||||
"Help",
|
|
||||||
"Image",
|
|
||||||
"Incubator",
|
|
||||||
"m",
|
|
||||||
"mail",
|
|
||||||
"mailarchive",
|
|
||||||
"media",
|
|
||||||
"MediaWiki",
|
|
||||||
"MediaWiki talk",
|
|
||||||
"Mediawikiwiki",
|
|
||||||
"MediaZilla",
|
|
||||||
"Meta",
|
|
||||||
"Metawikipedia",
|
|
||||||
"Module",
|
|
||||||
"mw",
|
|
||||||
"n",
|
|
||||||
"nost",
|
|
||||||
"oldwikisource",
|
|
||||||
"otrs",
|
|
||||||
"OTRSwiki",
|
|
||||||
"Overleg gebruiker",
|
|
||||||
"outreach",
|
|
||||||
"outreachwiki",
|
|
||||||
"Portal",
|
|
||||||
"phab",
|
|
||||||
"Phabricator",
|
|
||||||
"Project",
|
|
||||||
"q",
|
|
||||||
"quality",
|
|
||||||
"rev",
|
|
||||||
"s",
|
|
||||||
"spcom",
|
|
||||||
"Special",
|
|
||||||
"species",
|
|
||||||
"Strategy",
|
|
||||||
"sulutil",
|
|
||||||
"svn",
|
|
||||||
"Talk",
|
|
||||||
"Template",
|
|
||||||
"Template talk",
|
|
||||||
"Testwiki",
|
|
||||||
"ticket",
|
|
||||||
"TimedText",
|
|
||||||
"Toollabs",
|
|
||||||
"tools",
|
|
||||||
"tswiki",
|
|
||||||
"User",
|
|
||||||
"User talk",
|
|
||||||
"v",
|
|
||||||
"voy",
|
|
||||||
"w",
|
|
||||||
"Wikibooks",
|
|
||||||
"Wikidata",
|
|
||||||
"wikiHow",
|
|
||||||
"Wikinvest",
|
|
||||||
"wikilivres",
|
|
||||||
"Wikimedia",
|
|
||||||
"Wikinews",
|
|
||||||
"Wikipedia",
|
|
||||||
"Wikipedia talk",
|
|
||||||
"Wikiquote",
|
|
||||||
"Wikisource",
|
|
||||||
"Wikispecies",
|
|
||||||
"Wikitech",
|
|
||||||
"Wikiversity",
|
|
||||||
"Wikivoyage",
|
|
||||||
"wikt",
|
|
||||||
"wiktionary",
|
|
||||||
"wmf",
|
|
||||||
"wmania",
|
|
||||||
"WP",
|
|
||||||
]
|
|
||||||
)
|
|
|
@ -1,179 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
"""Script to process Wikipedia and Wikidata dumps and create a knowledge base (KB)
|
|
||||||
with specific parameters. Intermediate files are written to disk.
|
|
||||||
|
|
||||||
Running the full pipeline on a standard laptop, may take up to 13 hours of processing.
|
|
||||||
Use the -p, -d and -s options to speed up processing using the intermediate files
|
|
||||||
from a previous run.
|
|
||||||
|
|
||||||
For the Wikidata dump: get the latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
|
|
||||||
For the Wikipedia dump: get enwiki-latest-pages-articles-multistream.xml.bz2
|
|
||||||
from https://dumps.wikimedia.org/enwiki/latest/
|
|
||||||
|
|
||||||
"""
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import logging
|
|
||||||
from pathlib import Path
|
|
||||||
import plac
|
|
||||||
|
|
||||||
from bin.wiki_entity_linking import wikipedia_processor as wp, wikidata_processor as wd
|
|
||||||
from bin.wiki_entity_linking import wiki_io as io
|
|
||||||
from bin.wiki_entity_linking import kb_creator
|
|
||||||
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_FILE, ENTITY_DESCR_PATH, KB_MODEL_DIR, LOG_FORMAT
|
|
||||||
from bin.wiki_entity_linking import ENTITY_FREQ_PATH, PRIOR_PROB_PATH, ENTITY_DEFS_PATH, ENTITY_ALIAS_PATH
|
|
||||||
import spacy
|
|
||||||
from bin.wiki_entity_linking.kb_creator import read_kb
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
|
||||||
wd_json=("Path to the downloaded WikiData JSON dump.", "positional", None, Path),
|
|
||||||
wp_xml=("Path to the downloaded Wikipedia XML dump.", "positional", None, Path),
|
|
||||||
output_dir=("Output directory", "positional", None, Path),
|
|
||||||
model=("Model name or path, should include pretrained vectors.", "positional", None, str),
|
|
||||||
max_per_alias=("Max. # entities per alias (default 10)", "option", "a", int),
|
|
||||||
min_freq=("Min. count of an entity in the corpus (default 20)", "option", "f", int),
|
|
||||||
min_pair=("Min. count of entity-alias pairs (default 5)", "option", "c", int),
|
|
||||||
entity_vector_length=("Length of entity vectors (default 64)", "option", "v", int),
|
|
||||||
loc_prior_prob=("Location to file with prior probabilities", "option", "p", Path),
|
|
||||||
loc_entity_defs=("Location to file with entity definitions", "option", "d", Path),
|
|
||||||
loc_entity_desc=("Location to file with entity descriptions", "option", "s", Path),
|
|
||||||
descr_from_wp=("Flag for using descriptions from WP instead of WD (default False)", "flag", "wp"),
|
|
||||||
limit_prior=("Threshold to limit lines read from WP for prior probabilities", "option", "lp", int),
|
|
||||||
limit_train=("Threshold to limit lines read from WP for training set", "option", "lt", int),
|
|
||||||
limit_wd=("Threshold to limit lines read from WD", "option", "lw", int),
|
|
||||||
lang=("Optional language for which to get Wikidata titles. Defaults to 'en'", "option", "la", str),
|
|
||||||
)
|
|
||||||
def main(
|
|
||||||
wd_json,
|
|
||||||
wp_xml,
|
|
||||||
output_dir,
|
|
||||||
model,
|
|
||||||
max_per_alias=10,
|
|
||||||
min_freq=20,
|
|
||||||
min_pair=5,
|
|
||||||
entity_vector_length=64,
|
|
||||||
loc_prior_prob=None,
|
|
||||||
loc_entity_defs=None,
|
|
||||||
loc_entity_alias=None,
|
|
||||||
loc_entity_desc=None,
|
|
||||||
descr_from_wp=False,
|
|
||||||
limit_prior=None,
|
|
||||||
limit_train=None,
|
|
||||||
limit_wd=None,
|
|
||||||
lang="en",
|
|
||||||
):
|
|
||||||
entity_defs_path = loc_entity_defs if loc_entity_defs else output_dir / ENTITY_DEFS_PATH
|
|
||||||
entity_alias_path = loc_entity_alias if loc_entity_alias else output_dir / ENTITY_ALIAS_PATH
|
|
||||||
entity_descr_path = loc_entity_desc if loc_entity_desc else output_dir / ENTITY_DESCR_PATH
|
|
||||||
entity_freq_path = output_dir / ENTITY_FREQ_PATH
|
|
||||||
prior_prob_path = loc_prior_prob if loc_prior_prob else output_dir / PRIOR_PROB_PATH
|
|
||||||
training_entities_path = output_dir / TRAINING_DATA_FILE
|
|
||||||
kb_path = output_dir / KB_FILE
|
|
||||||
|
|
||||||
logger.info("Creating KB with Wikipedia and WikiData")
|
|
||||||
|
|
||||||
# STEP 0: set up IO
|
|
||||||
if not output_dir.exists():
|
|
||||||
output_dir.mkdir(parents=True)
|
|
||||||
|
|
||||||
# STEP 1: Load the NLP object
|
|
||||||
logger.info("STEP 1: Loading NLP model {}".format(model))
|
|
||||||
nlp = spacy.load(model)
|
|
||||||
|
|
||||||
# check the length of the nlp vectors
|
|
||||||
if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
|
|
||||||
raise ValueError(
|
|
||||||
"The `nlp` object should have access to pretrained word vectors, "
|
|
||||||
" cf. https://spacy.io/usage/models#languages."
|
|
||||||
)
|
|
||||||
|
|
||||||
# STEP 2: create prior probabilities from WP
|
|
||||||
if not prior_prob_path.exists():
|
|
||||||
# It takes about 2h to process 1000M lines of Wikipedia XML dump
|
|
||||||
logger.info("STEP 2: Writing prior probabilities to {}".format(prior_prob_path))
|
|
||||||
if limit_prior is not None:
|
|
||||||
logger.warning("Warning: reading only {} lines of Wikipedia dump".format(limit_prior))
|
|
||||||
wp.read_prior_probs(wp_xml, prior_prob_path, limit=limit_prior)
|
|
||||||
else:
|
|
||||||
logger.info("STEP 2: Reading prior probabilities from {}".format(prior_prob_path))
|
|
||||||
|
|
||||||
# STEP 3: calculate entity frequencies
|
|
||||||
if not entity_freq_path.exists():
|
|
||||||
logger.info("STEP 3: Calculating and writing entity frequencies to {}".format(entity_freq_path))
|
|
||||||
io.write_entity_to_count(prior_prob_path, entity_freq_path)
|
|
||||||
else:
|
|
||||||
logger.info("STEP 3: Reading entity frequencies from {}".format(entity_freq_path))
|
|
||||||
|
|
||||||
# STEP 4: reading definitions and (possibly) descriptions from WikiData or from file
|
|
||||||
if (not entity_defs_path.exists()) or (not descr_from_wp and not entity_descr_path.exists()):
|
|
||||||
# It takes about 10h to process 55M lines of Wikidata JSON dump
|
|
||||||
logger.info("STEP 4: Parsing and writing Wikidata entity definitions to {}".format(entity_defs_path))
|
|
||||||
if limit_wd is not None:
|
|
||||||
logger.warning("Warning: reading only {} lines of Wikidata dump".format(limit_wd))
|
|
||||||
title_to_id, id_to_descr, id_to_alias = wd.read_wikidata_entities_json(
|
|
||||||
wd_json,
|
|
||||||
limit_wd,
|
|
||||||
to_print=False,
|
|
||||||
lang=lang,
|
|
||||||
parse_descr=(not descr_from_wp),
|
|
||||||
)
|
|
||||||
io.write_title_to_id(entity_defs_path, title_to_id)
|
|
||||||
|
|
||||||
logger.info("STEP 4b: Writing Wikidata entity aliases to {}".format(entity_alias_path))
|
|
||||||
io.write_id_to_alias(entity_alias_path, id_to_alias)
|
|
||||||
|
|
||||||
if not descr_from_wp:
|
|
||||||
logger.info("STEP 4c: Writing Wikidata entity descriptions to {}".format(entity_descr_path))
|
|
||||||
io.write_id_to_descr(entity_descr_path, id_to_descr)
|
|
||||||
else:
|
|
||||||
logger.info("STEP 4: Reading entity definitions from {}".format(entity_defs_path))
|
|
||||||
logger.info("STEP 4b: Reading entity aliases from {}".format(entity_alias_path))
|
|
||||||
if not descr_from_wp:
|
|
||||||
logger.info("STEP 4c: Reading entity descriptions from {}".format(entity_descr_path))
|
|
||||||
|
|
||||||
# STEP 5: Getting gold entities from Wikipedia
|
|
||||||
if (not training_entities_path.exists()) or (descr_from_wp and not entity_descr_path.exists()):
|
|
||||||
logger.info("STEP 5: Parsing and writing Wikipedia gold entities to {}".format(training_entities_path))
|
|
||||||
if limit_train is not None:
|
|
||||||
logger.warning("Warning: reading only {} lines of Wikipedia dump".format(limit_train))
|
|
||||||
wp.create_training_and_desc(wp_xml, entity_defs_path, entity_descr_path,
|
|
||||||
training_entities_path, descr_from_wp, limit_train)
|
|
||||||
if descr_from_wp:
|
|
||||||
logger.info("STEP 5b: Parsing and writing Wikipedia descriptions to {}".format(entity_descr_path))
|
|
||||||
else:
|
|
||||||
logger.info("STEP 5: Reading gold entities from {}".format(training_entities_path))
|
|
||||||
if descr_from_wp:
|
|
||||||
logger.info("STEP 5b: Reading entity descriptions from {}".format(entity_descr_path))
|
|
||||||
|
|
||||||
# STEP 6: creating the actual KB
|
|
||||||
# It takes ca. 30 minutes to pretrain the entity embeddings
|
|
||||||
if not kb_path.exists():
|
|
||||||
logger.info("STEP 6: Creating the KB at {}".format(kb_path))
|
|
||||||
kb = kb_creator.create_kb(
|
|
||||||
nlp=nlp,
|
|
||||||
max_entities_per_alias=max_per_alias,
|
|
||||||
min_entity_freq=min_freq,
|
|
||||||
min_occ=min_pair,
|
|
||||||
entity_def_path=entity_defs_path,
|
|
||||||
entity_descr_path=entity_descr_path,
|
|
||||||
entity_alias_path=entity_alias_path,
|
|
||||||
entity_freq_path=entity_freq_path,
|
|
||||||
prior_prob_path=prior_prob_path,
|
|
||||||
entity_vector_length=entity_vector_length,
|
|
||||||
)
|
|
||||||
kb.dump(kb_path)
|
|
||||||
logger.info("kb entities: {}".format(kb.get_size_entities()))
|
|
||||||
logger.info("kb aliases: {}".format(kb.get_size_aliases()))
|
|
||||||
nlp.to_disk(output_dir / KB_MODEL_DIR)
|
|
||||||
else:
|
|
||||||
logger.info("STEP 6: KB already exists at {}".format(kb_path))
|
|
||||||
|
|
||||||
logger.info("Done!")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
|
|
||||||
plac.call(main)
|
|
|
@ -1,154 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import bz2
|
|
||||||
import json
|
|
||||||
import logging
|
|
||||||
|
|
||||||
from bin.wiki_entity_linking.wiki_namespaces import WD_META_ITEMS
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
|
|
||||||
def read_wikidata_entities_json(wikidata_file, limit=None, to_print=False, lang="en", parse_descr=True):
|
|
||||||
# Read the JSON wiki data and parse out the entities. Takes about 7-10h to parse 55M lines.
|
|
||||||
# get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
|
|
||||||
|
|
||||||
site_filter = '{}wiki'.format(lang)
|
|
||||||
|
|
||||||
# filter: currently defined as OR: one hit suffices to be removed from further processing
|
|
||||||
exclude_list = WD_META_ITEMS
|
|
||||||
|
|
||||||
# punctuation
|
|
||||||
exclude_list.extend(["Q1383557", "Q10617810"])
|
|
||||||
|
|
||||||
# letters etc
|
|
||||||
exclude_list.extend(["Q188725", "Q19776628", "Q3841820", "Q17907810", "Q9788", "Q9398093"])
|
|
||||||
|
|
||||||
neg_prop_filter = {
|
|
||||||
'P31': exclude_list, # instance of
|
|
||||||
'P279': exclude_list # subclass
|
|
||||||
}
|
|
||||||
|
|
||||||
title_to_id = dict()
|
|
||||||
id_to_descr = dict()
|
|
||||||
id_to_alias = dict()
|
|
||||||
|
|
||||||
# parse appropriate fields - depending on what we need in the KB
|
|
||||||
parse_properties = False
|
|
||||||
parse_sitelinks = True
|
|
||||||
parse_labels = False
|
|
||||||
parse_aliases = True
|
|
||||||
parse_claims = True
|
|
||||||
|
|
||||||
with bz2.open(wikidata_file, mode='rb') as file:
|
|
||||||
for cnt, line in enumerate(file):
|
|
||||||
if limit and cnt >= limit:
|
|
||||||
break
|
|
||||||
if cnt % 500000 == 0 and cnt > 0:
|
|
||||||
logger.info("processed {} lines of WikiData JSON dump".format(cnt))
|
|
||||||
clean_line = line.strip()
|
|
||||||
if clean_line.endswith(b","):
|
|
||||||
clean_line = clean_line[:-1]
|
|
||||||
if len(clean_line) > 1:
|
|
||||||
obj = json.loads(clean_line)
|
|
||||||
entry_type = obj["type"]
|
|
||||||
|
|
||||||
if entry_type == "item":
|
|
||||||
keep = True
|
|
||||||
|
|
||||||
claims = obj["claims"]
|
|
||||||
if parse_claims:
|
|
||||||
for prop, value_set in neg_prop_filter.items():
|
|
||||||
claim_property = claims.get(prop, None)
|
|
||||||
if claim_property:
|
|
||||||
for cp in claim_property:
|
|
||||||
cp_id = (
|
|
||||||
cp["mainsnak"]
|
|
||||||
.get("datavalue", {})
|
|
||||||
.get("value", {})
|
|
||||||
.get("id")
|
|
||||||
)
|
|
||||||
cp_rank = cp["rank"]
|
|
||||||
if cp_rank != "deprecated" and cp_id in value_set:
|
|
||||||
keep = False
|
|
||||||
|
|
||||||
if keep:
|
|
||||||
unique_id = obj["id"]
|
|
||||||
|
|
||||||
if to_print:
|
|
||||||
print("ID:", unique_id)
|
|
||||||
print("type:", entry_type)
|
|
||||||
|
|
||||||
# parsing all properties that refer to other entities
|
|
||||||
if parse_properties:
|
|
||||||
for prop, claim_property in claims.items():
|
|
||||||
cp_dicts = [
|
|
||||||
cp["mainsnak"]["datavalue"].get("value")
|
|
||||||
for cp in claim_property
|
|
||||||
if cp["mainsnak"].get("datavalue")
|
|
||||||
]
|
|
||||||
cp_values = [
|
|
||||||
cp_dict.get("id")
|
|
||||||
for cp_dict in cp_dicts
|
|
||||||
if isinstance(cp_dict, dict)
|
|
||||||
if cp_dict.get("id") is not None
|
|
||||||
]
|
|
||||||
if cp_values:
|
|
||||||
if to_print:
|
|
||||||
print("prop:", prop, cp_values)
|
|
||||||
|
|
||||||
found_link = False
|
|
||||||
if parse_sitelinks:
|
|
||||||
site_value = obj["sitelinks"].get(site_filter, None)
|
|
||||||
if site_value:
|
|
||||||
site = site_value["title"]
|
|
||||||
if to_print:
|
|
||||||
print(site_filter, ":", site)
|
|
||||||
title_to_id[site] = unique_id
|
|
||||||
found_link = True
|
|
||||||
|
|
||||||
if parse_labels:
|
|
||||||
labels = obj["labels"]
|
|
||||||
if labels:
|
|
||||||
lang_label = labels.get(lang, None)
|
|
||||||
if lang_label:
|
|
||||||
if to_print:
|
|
||||||
print(
|
|
||||||
"label (" + lang + "):", lang_label["value"]
|
|
||||||
)
|
|
||||||
|
|
||||||
if found_link and parse_descr:
|
|
||||||
descriptions = obj["descriptions"]
|
|
||||||
if descriptions:
|
|
||||||
lang_descr = descriptions.get(lang, None)
|
|
||||||
if lang_descr:
|
|
||||||
if to_print:
|
|
||||||
print(
|
|
||||||
"description (" + lang + "):",
|
|
||||||
lang_descr["value"],
|
|
||||||
)
|
|
||||||
id_to_descr[unique_id] = lang_descr["value"]
|
|
||||||
|
|
||||||
if parse_aliases:
|
|
||||||
aliases = obj["aliases"]
|
|
||||||
if aliases:
|
|
||||||
lang_aliases = aliases.get(lang, None)
|
|
||||||
if lang_aliases:
|
|
||||||
for item in lang_aliases:
|
|
||||||
if to_print:
|
|
||||||
print(
|
|
||||||
"alias (" + lang + "):", item["value"]
|
|
||||||
)
|
|
||||||
alias_list = id_to_alias.get(unique_id, [])
|
|
||||||
alias_list.append(item["value"])
|
|
||||||
id_to_alias[unique_id] = alias_list
|
|
||||||
|
|
||||||
if to_print:
|
|
||||||
print()
|
|
||||||
|
|
||||||
# log final number of lines processed
|
|
||||||
logger.info("Finished. Processed {} lines of WikiData JSON dump".format(cnt))
|
|
||||||
return title_to_id, id_to_descr, id_to_alias
|
|
||||||
|
|
||||||
|
|
|
@ -1,230 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
"""Script that takes a previously created Knowledge Base and trains an entity linking
|
|
||||||
pipeline. The provided KB directory should hold the kb, the original nlp object and
|
|
||||||
its vocab used to create the KB, and a few auxiliary files such as the entity definitions,
|
|
||||||
as created by the script `wikidata_create_kb`.
|
|
||||||
|
|
||||||
For the Wikipedia dump: get enwiki-latest-pages-articles-multistream.xml.bz2
|
|
||||||
from https://dumps.wikimedia.org/enwiki/latest/
|
|
||||||
"""
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import random
|
|
||||||
import logging
|
|
||||||
import spacy
|
|
||||||
from pathlib import Path
|
|
||||||
import plac
|
|
||||||
from tqdm import tqdm
|
|
||||||
|
|
||||||
from bin.wiki_entity_linking import wikipedia_processor
|
|
||||||
from bin.wiki_entity_linking import (
|
|
||||||
TRAINING_DATA_FILE,
|
|
||||||
KB_MODEL_DIR,
|
|
||||||
KB_FILE,
|
|
||||||
LOG_FORMAT,
|
|
||||||
OUTPUT_MODEL_DIR,
|
|
||||||
)
|
|
||||||
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
|
|
||||||
from bin.wiki_entity_linking.kb_creator import read_kb
|
|
||||||
|
|
||||||
from spacy.util import minibatch, compounding
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
|
||||||
dir_kb=("Directory with KB, NLP and related files", "positional", None, Path),
|
|
||||||
output_dir=("Output directory", "option", "o", Path),
|
|
||||||
loc_training=("Location to training data", "option", "k", Path),
|
|
||||||
epochs=("Number of training iterations (default 10)", "option", "e", int),
|
|
||||||
dropout=("Dropout to prevent overfitting (default 0.5)", "option", "p", float),
|
|
||||||
lr=("Learning rate (default 0.005)", "option", "n", float),
|
|
||||||
l2=("L2 regularization", "option", "r", float),
|
|
||||||
train_articles=("# training articles (default 90% of all)", "option", "t", int),
|
|
||||||
dev_articles=("# dev test articles (default 10% of all)", "option", "d", int),
|
|
||||||
labels_discard=("NER labels to discard (default None)", "option", "l", str),
|
|
||||||
)
|
|
||||||
def main(
|
|
||||||
dir_kb,
|
|
||||||
output_dir=None,
|
|
||||||
loc_training=None,
|
|
||||||
epochs=10,
|
|
||||||
dropout=0.5,
|
|
||||||
lr=0.005,
|
|
||||||
l2=1e-6,
|
|
||||||
train_articles=None,
|
|
||||||
dev_articles=None,
|
|
||||||
labels_discard=None,
|
|
||||||
):
|
|
||||||
if not output_dir:
|
|
||||||
logger.warning(
|
|
||||||
"No output dir specified so no results will be written, are you sure about this ?"
|
|
||||||
)
|
|
||||||
|
|
||||||
logger.info("Creating Entity Linker with Wikipedia and WikiData")
|
|
||||||
|
|
||||||
output_dir = Path(output_dir) if output_dir else dir_kb
|
|
||||||
training_path = loc_training if loc_training else dir_kb / TRAINING_DATA_FILE
|
|
||||||
nlp_dir = dir_kb / KB_MODEL_DIR
|
|
||||||
kb_path = dir_kb / KB_FILE
|
|
||||||
nlp_output_dir = output_dir / OUTPUT_MODEL_DIR
|
|
||||||
|
|
||||||
# STEP 0: set up IO
|
|
||||||
if not output_dir.exists():
|
|
||||||
output_dir.mkdir()
|
|
||||||
|
|
||||||
# STEP 1 : load the NLP object
|
|
||||||
logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
|
|
||||||
nlp = spacy.load(nlp_dir)
|
|
||||||
logger.info(
|
|
||||||
"Original NLP pipeline has following pipeline components: {}".format(
|
|
||||||
nlp.pipe_names
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
# check that there is a NER component in the pipeline
|
|
||||||
if "ner" not in nlp.pipe_names:
|
|
||||||
raise ValueError("The `nlp` object should have a pretrained `ner` component.")
|
|
||||||
|
|
||||||
logger.info("STEP 1b: Loading KB from {}".format(kb_path))
|
|
||||||
kb = read_kb(nlp, kb_path)
|
|
||||||
|
|
||||||
# STEP 2: read the training dataset previously created from WP
|
|
||||||
logger.info("STEP 2: Reading training & dev dataset from {}".format(training_path))
|
|
||||||
train_indices, dev_indices = wikipedia_processor.read_training_indices(
|
|
||||||
training_path
|
|
||||||
)
|
|
||||||
logger.info(
|
|
||||||
"Training set has {} articles, limit set to roughly {} articles per epoch".format(
|
|
||||||
len(train_indices), train_articles if train_articles else "all"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
logger.info(
|
|
||||||
"Dev set has {} articles, limit set to rougly {} articles for evaluation".format(
|
|
||||||
len(dev_indices), dev_articles if dev_articles else "all"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
if dev_articles:
|
|
||||||
dev_indices = dev_indices[0:dev_articles]
|
|
||||||
|
|
||||||
# STEP 3: create and train an entity linking pipe
|
|
||||||
logger.info(
|
|
||||||
"STEP 3: Creating and training an Entity Linking pipe for {} epochs".format(
|
|
||||||
epochs
|
|
||||||
)
|
|
||||||
)
|
|
||||||
if labels_discard:
|
|
||||||
labels_discard = [x.strip() for x in labels_discard.split(",")]
|
|
||||||
logger.info(
|
|
||||||
"Discarding {} NER types: {}".format(len(labels_discard), labels_discard)
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
labels_discard = []
|
|
||||||
|
|
||||||
el_pipe = nlp.create_pipe(
|
|
||||||
name="entity_linker",
|
|
||||||
config={
|
|
||||||
"pretrained_vectors": nlp.vocab.vectors,
|
|
||||||
"labels_discard": labels_discard,
|
|
||||||
},
|
|
||||||
)
|
|
||||||
el_pipe.set_kb(kb)
|
|
||||||
nlp.add_pipe(el_pipe, last=True)
|
|
||||||
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"]
|
|
||||||
with nlp.disable_pipes(*other_pipes): # only train Entity Linking
|
|
||||||
optimizer = nlp.begin_training()
|
|
||||||
optimizer.learn_rate = lr
|
|
||||||
optimizer.L2 = l2
|
|
||||||
|
|
||||||
logger.info("Dev Baseline Accuracies:")
|
|
||||||
dev_data = wikipedia_processor.read_el_docs_golds(
|
|
||||||
nlp=nlp,
|
|
||||||
entity_file_path=training_path,
|
|
||||||
dev=True,
|
|
||||||
line_ids=dev_indices,
|
|
||||||
kb=kb,
|
|
||||||
labels_discard=labels_discard,
|
|
||||||
)
|
|
||||||
|
|
||||||
measure_performance(
|
|
||||||
dev_data, kb, el_pipe, baseline=True, context=False, dev_limit=len(dev_indices)
|
|
||||||
)
|
|
||||||
|
|
||||||
for itn in range(epochs):
|
|
||||||
random.shuffle(train_indices)
|
|
||||||
losses = {}
|
|
||||||
batches = minibatch(train_indices, size=compounding(8.0, 128.0, 1.001))
|
|
||||||
batchnr = 0
|
|
||||||
articles_processed = 0
|
|
||||||
|
|
||||||
# we either process the whole training file, or just a part each epoch
|
|
||||||
bar_total = len(train_indices)
|
|
||||||
if train_articles:
|
|
||||||
bar_total = train_articles
|
|
||||||
|
|
||||||
with tqdm(total=bar_total, leave=False, desc=f"Epoch {itn}") as pbar:
|
|
||||||
for batch in batches:
|
|
||||||
if not train_articles or articles_processed < train_articles:
|
|
||||||
with nlp.disable_pipes("entity_linker"):
|
|
||||||
train_batch = wikipedia_processor.read_el_docs_golds(
|
|
||||||
nlp=nlp,
|
|
||||||
entity_file_path=training_path,
|
|
||||||
dev=False,
|
|
||||||
line_ids=batch,
|
|
||||||
kb=kb,
|
|
||||||
labels_discard=labels_discard,
|
|
||||||
)
|
|
||||||
try:
|
|
||||||
with nlp.disable_pipes(*other_pipes):
|
|
||||||
nlp.update(
|
|
||||||
examples=train_batch,
|
|
||||||
sgd=optimizer,
|
|
||||||
drop=dropout,
|
|
||||||
losses=losses,
|
|
||||||
)
|
|
||||||
batchnr += 1
|
|
||||||
articles_processed += len(docs)
|
|
||||||
pbar.update(len(docs))
|
|
||||||
except Exception as e:
|
|
||||||
logger.error("Error updating batch:" + str(e))
|
|
||||||
if batchnr > 0:
|
|
||||||
logging.info(
|
|
||||||
"Epoch {} trained on {} articles, train loss {}".format(
|
|
||||||
itn, articles_processed, round(losses["entity_linker"] / batchnr, 2)
|
|
||||||
)
|
|
||||||
)
|
|
||||||
# re-read the dev_data (data is returned as a generator)
|
|
||||||
dev_data = wikipedia_processor.read_el_docs_golds(
|
|
||||||
nlp=nlp,
|
|
||||||
entity_file_path=training_path,
|
|
||||||
dev=True,
|
|
||||||
line_ids=dev_indices,
|
|
||||||
kb=kb,
|
|
||||||
labels_discard=labels_discard,
|
|
||||||
)
|
|
||||||
measure_performance(
|
|
||||||
dev_data,
|
|
||||||
kb,
|
|
||||||
el_pipe,
|
|
||||||
baseline=False,
|
|
||||||
context=True,
|
|
||||||
dev_limit=len(dev_indices),
|
|
||||||
)
|
|
||||||
|
|
||||||
if output_dir:
|
|
||||||
# STEP 4: write the NLP pipeline (now including an EL model) to file
|
|
||||||
logger.info(
|
|
||||||
"Final NLP pipeline has following pipeline components: {}".format(
|
|
||||||
nlp.pipe_names
|
|
||||||
)
|
|
||||||
)
|
|
||||||
logger.info("STEP 4: Writing trained NLP to {}".format(nlp_output_dir))
|
|
||||||
nlp.to_disk(nlp_output_dir)
|
|
||||||
|
|
||||||
logger.info("Done!")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
|
|
||||||
plac.call(main)
|
|
|
@ -1,565 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import re
|
|
||||||
import bz2
|
|
||||||
import logging
|
|
||||||
import random
|
|
||||||
import json
|
|
||||||
|
|
||||||
from spacy.gold import GoldParse
|
|
||||||
from bin.wiki_entity_linking import wiki_io as io
|
|
||||||
from bin.wiki_entity_linking.wiki_namespaces import (
|
|
||||||
WP_META_NAMESPACE,
|
|
||||||
WP_FILE_NAMESPACE,
|
|
||||||
WP_CATEGORY_NAMESPACE,
|
|
||||||
)
|
|
||||||
|
|
||||||
"""
|
|
||||||
Process a Wikipedia dump to calculate entity frequencies and prior probabilities in combination with certain mentions.
|
|
||||||
Write these results to file for downstream KB and training data generation.
|
|
||||||
|
|
||||||
Process Wikipedia interlinks to generate a training dataset for the EL algorithm.
|
|
||||||
"""
|
|
||||||
|
|
||||||
ENTITY_FILE = "gold_entities.csv"
|
|
||||||
|
|
||||||
map_alias_to_link = dict()
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
title_regex = re.compile(r"(?<=<title>).*(?=</title>)")
|
|
||||||
id_regex = re.compile(r"(?<=<id>)\d*(?=</id>)")
|
|
||||||
text_regex = re.compile(r"(?<=<text xml:space=\"preserve\">).*(?=</text)")
|
|
||||||
info_regex = re.compile(r"{[^{]*?}")
|
|
||||||
html_regex = re.compile(r"<!--[^-]*-->")
|
|
||||||
ref_regex = re.compile(r"<ref.*?>") # non-greedy
|
|
||||||
ref_2_regex = re.compile(r"</ref.*?>") # non-greedy
|
|
||||||
|
|
||||||
# find the links
|
|
||||||
link_regex = re.compile(r"\[\[[^\[\]]*\]\]")
|
|
||||||
|
|
||||||
# match on interwiki links, e.g. `en:` or `:fr:`
|
|
||||||
ns_regex = r":?" + "[a-z][a-z]" + ":"
|
|
||||||
# match on Namespace: optionally preceded by a :
|
|
||||||
for ns in WP_META_NAMESPACE:
|
|
||||||
ns_regex += "|" + ":?" + ns + ":"
|
|
||||||
ns_regex = re.compile(ns_regex, re.IGNORECASE)
|
|
||||||
|
|
||||||
files = r""
|
|
||||||
for f in WP_FILE_NAMESPACE:
|
|
||||||
files += "\[\[" + f + ":[^[\]]+]]" + "|"
|
|
||||||
files = files[0 : len(files) - 1]
|
|
||||||
file_regex = re.compile(files)
|
|
||||||
|
|
||||||
cats = r""
|
|
||||||
for c in WP_CATEGORY_NAMESPACE:
|
|
||||||
cats += "\[\[" + c + ":[^\[]*]]" + "|"
|
|
||||||
cats = cats[0 : len(cats) - 1]
|
|
||||||
category_regex = re.compile(cats)
|
|
||||||
|
|
||||||
|
|
||||||
def read_prior_probs(wikipedia_input, prior_prob_output, limit=None):
|
|
||||||
"""
|
|
||||||
Read the XML wikipedia data and parse out intra-wiki links to estimate prior probabilities.
|
|
||||||
The full file takes about 2-3h to parse 1100M lines.
|
|
||||||
It works relatively fast because it runs line by line, irrelevant of which article the intrawiki is from,
|
|
||||||
though dev test articles are excluded in order not to get an artificially strong baseline.
|
|
||||||
"""
|
|
||||||
cnt = 0
|
|
||||||
read_id = False
|
|
||||||
current_article_id = None
|
|
||||||
with bz2.open(wikipedia_input, mode="rb") as file:
|
|
||||||
line = file.readline()
|
|
||||||
while line and (not limit or cnt < limit):
|
|
||||||
if cnt % 25000000 == 0 and cnt > 0:
|
|
||||||
logger.info("processed {} lines of Wikipedia XML dump".format(cnt))
|
|
||||||
clean_line = line.strip().decode("utf-8")
|
|
||||||
|
|
||||||
# we attempt at reading the article's ID (but not the revision or contributor ID)
|
|
||||||
if "<revision>" in clean_line or "<contributor>" in clean_line:
|
|
||||||
read_id = False
|
|
||||||
if "<page>" in clean_line:
|
|
||||||
read_id = True
|
|
||||||
|
|
||||||
if read_id:
|
|
||||||
ids = id_regex.search(clean_line)
|
|
||||||
if ids:
|
|
||||||
current_article_id = ids[0]
|
|
||||||
|
|
||||||
# only processing prior probabilities from true training (non-dev) articles
|
|
||||||
if not is_dev(current_article_id):
|
|
||||||
aliases, entities, normalizations = get_wp_links(clean_line)
|
|
||||||
for alias, entity, norm in zip(aliases, entities, normalizations):
|
|
||||||
_store_alias(
|
|
||||||
alias, entity, normalize_alias=norm, normalize_entity=True
|
|
||||||
)
|
|
||||||
|
|
||||||
line = file.readline()
|
|
||||||
cnt += 1
|
|
||||||
logger.info("processed {} lines of Wikipedia XML dump".format(cnt))
|
|
||||||
logger.info("Finished. processed {} lines of Wikipedia XML dump".format(cnt))
|
|
||||||
|
|
||||||
# write all aliases and their entities and count occurrences to file
|
|
||||||
with prior_prob_output.open("w", encoding="utf8") as outputfile:
|
|
||||||
outputfile.write("alias" + "|" + "count" + "|" + "entity" + "\n")
|
|
||||||
for alias, alias_dict in sorted(map_alias_to_link.items(), key=lambda x: x[0]):
|
|
||||||
s_dict = sorted(alias_dict.items(), key=lambda x: x[1], reverse=True)
|
|
||||||
for entity, count in s_dict:
|
|
||||||
outputfile.write(alias + "|" + str(count) + "|" + entity + "\n")
|
|
||||||
|
|
||||||
|
|
||||||
def _store_alias(alias, entity, normalize_alias=False, normalize_entity=True):
|
|
||||||
alias = alias.strip()
|
|
||||||
entity = entity.strip()
|
|
||||||
|
|
||||||
# remove everything after # as this is not part of the title but refers to a specific paragraph
|
|
||||||
if normalize_entity:
|
|
||||||
# wikipedia titles are always capitalized
|
|
||||||
entity = _capitalize_first(entity.split("#")[0])
|
|
||||||
if normalize_alias:
|
|
||||||
alias = alias.split("#")[0]
|
|
||||||
|
|
||||||
if alias and entity:
|
|
||||||
alias_dict = map_alias_to_link.get(alias, dict())
|
|
||||||
entity_count = alias_dict.get(entity, 0)
|
|
||||||
alias_dict[entity] = entity_count + 1
|
|
||||||
map_alias_to_link[alias] = alias_dict
|
|
||||||
|
|
||||||
|
|
||||||
def get_wp_links(text):
|
|
||||||
aliases = []
|
|
||||||
entities = []
|
|
||||||
normalizations = []
|
|
||||||
|
|
||||||
matches = link_regex.findall(text)
|
|
||||||
for match in matches:
|
|
||||||
match = match[2:][:-2].replace("_", " ").strip()
|
|
||||||
|
|
||||||
if ns_regex.match(match):
|
|
||||||
pass # ignore the entity if it points to a "meta" page
|
|
||||||
|
|
||||||
# this is a simple [[link]], with the alias the same as the mention
|
|
||||||
elif "|" not in match:
|
|
||||||
aliases.append(match)
|
|
||||||
entities.append(match)
|
|
||||||
normalizations.append(True)
|
|
||||||
|
|
||||||
# in wiki format, the link is written as [[entity|alias]]
|
|
||||||
else:
|
|
||||||
splits = match.split("|")
|
|
||||||
entity = splits[0].strip()
|
|
||||||
alias = splits[1].strip()
|
|
||||||
# specific wiki format [[alias (specification)|]]
|
|
||||||
if len(alias) == 0 and "(" in entity:
|
|
||||||
alias = entity.split("(")[0]
|
|
||||||
aliases.append(alias)
|
|
||||||
entities.append(entity)
|
|
||||||
normalizations.append(False)
|
|
||||||
else:
|
|
||||||
aliases.append(alias)
|
|
||||||
entities.append(entity)
|
|
||||||
normalizations.append(False)
|
|
||||||
|
|
||||||
return aliases, entities, normalizations
|
|
||||||
|
|
||||||
|
|
||||||
def _capitalize_first(text):
|
|
||||||
if not text:
|
|
||||||
return None
|
|
||||||
result = text[0].capitalize()
|
|
||||||
if len(result) > 0:
|
|
||||||
result += text[1:]
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def create_training_and_desc(
|
|
||||||
wp_input, def_input, desc_output, training_output, parse_desc, limit=None
|
|
||||||
):
|
|
||||||
wp_to_id = io.read_title_to_id(def_input)
|
|
||||||
_process_wikipedia_texts(
|
|
||||||
wp_input, wp_to_id, desc_output, training_output, parse_desc, limit
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def _process_wikipedia_texts(
|
|
||||||
wikipedia_input, wp_to_id, output, training_output, parse_descriptions, limit=None
|
|
||||||
):
|
|
||||||
"""
|
|
||||||
Read the XML wikipedia data to parse out training data:
|
|
||||||
raw text data + positive instances
|
|
||||||
"""
|
|
||||||
|
|
||||||
read_ids = set()
|
|
||||||
|
|
||||||
with output.open("a", encoding="utf8") as descr_file, training_output.open(
|
|
||||||
"w", encoding="utf8"
|
|
||||||
) as entity_file:
|
|
||||||
if parse_descriptions:
|
|
||||||
_write_training_description(descr_file, "WD_id", "description")
|
|
||||||
with bz2.open(wikipedia_input, mode="rb") as file:
|
|
||||||
article_count = 0
|
|
||||||
article_text = ""
|
|
||||||
article_title = None
|
|
||||||
article_id = None
|
|
||||||
reading_text = False
|
|
||||||
reading_revision = False
|
|
||||||
|
|
||||||
for line in file:
|
|
||||||
clean_line = line.strip().decode("utf-8")
|
|
||||||
|
|
||||||
if clean_line == "<revision>":
|
|
||||||
reading_revision = True
|
|
||||||
elif clean_line == "</revision>":
|
|
||||||
reading_revision = False
|
|
||||||
|
|
||||||
# Start reading new page
|
|
||||||
if clean_line == "<page>":
|
|
||||||
article_text = ""
|
|
||||||
article_title = None
|
|
||||||
article_id = None
|
|
||||||
# finished reading this page
|
|
||||||
elif clean_line == "</page>":
|
|
||||||
if article_id:
|
|
||||||
clean_text, entities = _process_wp_text(
|
|
||||||
article_title, article_text, wp_to_id
|
|
||||||
)
|
|
||||||
if clean_text is not None and entities is not None:
|
|
||||||
_write_training_entities(
|
|
||||||
entity_file, article_id, clean_text, entities
|
|
||||||
)
|
|
||||||
|
|
||||||
if article_title in wp_to_id and parse_descriptions:
|
|
||||||
description = " ".join(
|
|
||||||
clean_text[:1000].split(" ")[:-1]
|
|
||||||
)
|
|
||||||
_write_training_description(
|
|
||||||
descr_file, wp_to_id[article_title], description
|
|
||||||
)
|
|
||||||
article_count += 1
|
|
||||||
if article_count % 10000 == 0 and article_count > 0:
|
|
||||||
logger.info(
|
|
||||||
"Processed {} articles".format(article_count)
|
|
||||||
)
|
|
||||||
if limit and article_count >= limit:
|
|
||||||
break
|
|
||||||
article_text = ""
|
|
||||||
article_title = None
|
|
||||||
article_id = None
|
|
||||||
reading_text = False
|
|
||||||
reading_revision = False
|
|
||||||
|
|
||||||
# start reading text within a page
|
|
||||||
if "<text" in clean_line:
|
|
||||||
reading_text = True
|
|
||||||
|
|
||||||
if reading_text:
|
|
||||||
article_text += " " + clean_line
|
|
||||||
|
|
||||||
# stop reading text within a page (we assume a new page doesn't start on the same line)
|
|
||||||
if "</text" in clean_line:
|
|
||||||
reading_text = False
|
|
||||||
|
|
||||||
# read the ID of this article (outside the revision portion of the document)
|
|
||||||
if not reading_revision:
|
|
||||||
ids = id_regex.search(clean_line)
|
|
||||||
if ids:
|
|
||||||
article_id = ids[0]
|
|
||||||
if article_id in read_ids:
|
|
||||||
logger.info(
|
|
||||||
"Found duplicate article ID", article_id, clean_line
|
|
||||||
) # This should never happen ...
|
|
||||||
read_ids.add(article_id)
|
|
||||||
|
|
||||||
# read the title of this article (outside the revision portion of the document)
|
|
||||||
if not reading_revision:
|
|
||||||
titles = title_regex.search(clean_line)
|
|
||||||
if titles:
|
|
||||||
article_title = titles[0].strip()
|
|
||||||
logger.info("Finished. Processed {} articles".format(article_count))
|
|
||||||
|
|
||||||
|
|
||||||
def _process_wp_text(article_title, article_text, wp_to_id):
|
|
||||||
# ignore meta Wikipedia pages
|
|
||||||
if ns_regex.match(article_title):
|
|
||||||
return None, None
|
|
||||||
|
|
||||||
# remove the text tags
|
|
||||||
text_search = text_regex.search(article_text)
|
|
||||||
if text_search is None:
|
|
||||||
return None, None
|
|
||||||
text = text_search.group(0)
|
|
||||||
|
|
||||||
# stop processing if this is a redirect page
|
|
||||||
if text.startswith("#REDIRECT"):
|
|
||||||
return None, None
|
|
||||||
|
|
||||||
# get the raw text without markup etc, keeping only interwiki links
|
|
||||||
clean_text, entities = _remove_links(_get_clean_wp_text(text), wp_to_id)
|
|
||||||
return clean_text, entities
|
|
||||||
|
|
||||||
|
|
||||||
def _get_clean_wp_text(article_text):
|
|
||||||
clean_text = article_text.strip()
|
|
||||||
|
|
||||||
# remove bolding & italic markup
|
|
||||||
clean_text = clean_text.replace("'''", "")
|
|
||||||
clean_text = clean_text.replace("''", "")
|
|
||||||
|
|
||||||
# remove nested {{info}} statements by removing the inner/smallest ones first and iterating
|
|
||||||
try_again = True
|
|
||||||
previous_length = len(clean_text)
|
|
||||||
while try_again:
|
|
||||||
clean_text = info_regex.sub(
|
|
||||||
"", clean_text
|
|
||||||
) # non-greedy match excluding a nested {
|
|
||||||
if len(clean_text) < previous_length:
|
|
||||||
try_again = True
|
|
||||||
else:
|
|
||||||
try_again = False
|
|
||||||
previous_length = len(clean_text)
|
|
||||||
|
|
||||||
# remove HTML comments
|
|
||||||
clean_text = html_regex.sub("", clean_text)
|
|
||||||
|
|
||||||
# remove Category and File statements
|
|
||||||
clean_text = category_regex.sub("", clean_text)
|
|
||||||
clean_text = file_regex.sub("", clean_text)
|
|
||||||
|
|
||||||
# remove multiple =
|
|
||||||
while "==" in clean_text:
|
|
||||||
clean_text = clean_text.replace("==", "=")
|
|
||||||
|
|
||||||
clean_text = clean_text.replace(". =", ".")
|
|
||||||
clean_text = clean_text.replace(" = ", ". ")
|
|
||||||
clean_text = clean_text.replace("= ", ".")
|
|
||||||
clean_text = clean_text.replace(" =", "")
|
|
||||||
|
|
||||||
# remove refs (non-greedy match)
|
|
||||||
clean_text = ref_regex.sub("", clean_text)
|
|
||||||
clean_text = ref_2_regex.sub("", clean_text)
|
|
||||||
|
|
||||||
# remove additional wikiformatting
|
|
||||||
clean_text = re.sub(r"<blockquote>", "", clean_text)
|
|
||||||
clean_text = re.sub(r"</blockquote>", "", clean_text)
|
|
||||||
|
|
||||||
# change special characters back to normal ones
|
|
||||||
clean_text = clean_text.replace(r"<", "<")
|
|
||||||
clean_text = clean_text.replace(r">", ">")
|
|
||||||
clean_text = clean_text.replace(r""", '"')
|
|
||||||
clean_text = clean_text.replace(r"&nbsp;", " ")
|
|
||||||
clean_text = clean_text.replace(r"&", "&")
|
|
||||||
|
|
||||||
# remove multiple spaces
|
|
||||||
while " " in clean_text:
|
|
||||||
clean_text = clean_text.replace(" ", " ")
|
|
||||||
|
|
||||||
return clean_text.strip()
|
|
||||||
|
|
||||||
|
|
||||||
def _remove_links(clean_text, wp_to_id):
|
|
||||||
# read the text char by char to get the right offsets for the interwiki links
|
|
||||||
entities = []
|
|
||||||
final_text = ""
|
|
||||||
open_read = 0
|
|
||||||
reading_text = True
|
|
||||||
reading_entity = False
|
|
||||||
reading_mention = False
|
|
||||||
reading_special_case = False
|
|
||||||
entity_buffer = ""
|
|
||||||
mention_buffer = ""
|
|
||||||
for index, letter in enumerate(clean_text):
|
|
||||||
if letter == "[":
|
|
||||||
open_read += 1
|
|
||||||
elif letter == "]":
|
|
||||||
open_read -= 1
|
|
||||||
elif letter == "|":
|
|
||||||
if reading_text:
|
|
||||||
final_text += letter
|
|
||||||
# switch from reading entity to mention in the [[entity|mention]] pattern
|
|
||||||
elif reading_entity:
|
|
||||||
reading_text = False
|
|
||||||
reading_entity = False
|
|
||||||
reading_mention = True
|
|
||||||
else:
|
|
||||||
reading_special_case = True
|
|
||||||
else:
|
|
||||||
if reading_entity:
|
|
||||||
entity_buffer += letter
|
|
||||||
elif reading_mention:
|
|
||||||
mention_buffer += letter
|
|
||||||
elif reading_text:
|
|
||||||
final_text += letter
|
|
||||||
else:
|
|
||||||
raise ValueError("Not sure at point", clean_text[index - 2 : index + 2])
|
|
||||||
|
|
||||||
if open_read > 2:
|
|
||||||
reading_special_case = True
|
|
||||||
|
|
||||||
if open_read == 2 and reading_text:
|
|
||||||
reading_text = False
|
|
||||||
reading_entity = True
|
|
||||||
reading_mention = False
|
|
||||||
|
|
||||||
# we just finished reading an entity
|
|
||||||
if open_read == 0 and not reading_text:
|
|
||||||
if "#" in entity_buffer or entity_buffer.startswith(":"):
|
|
||||||
reading_special_case = True
|
|
||||||
# Ignore cases with nested structures like File: handles etc
|
|
||||||
if not reading_special_case:
|
|
||||||
if not mention_buffer:
|
|
||||||
mention_buffer = entity_buffer
|
|
||||||
start = len(final_text)
|
|
||||||
end = start + len(mention_buffer)
|
|
||||||
qid = wp_to_id.get(entity_buffer, None)
|
|
||||||
if qid:
|
|
||||||
entities.append((mention_buffer, qid, start, end))
|
|
||||||
final_text += mention_buffer
|
|
||||||
|
|
||||||
entity_buffer = ""
|
|
||||||
mention_buffer = ""
|
|
||||||
|
|
||||||
reading_text = True
|
|
||||||
reading_entity = False
|
|
||||||
reading_mention = False
|
|
||||||
reading_special_case = False
|
|
||||||
return final_text, entities
|
|
||||||
|
|
||||||
|
|
||||||
def _write_training_description(outputfile, qid, description):
|
|
||||||
if description is not None:
|
|
||||||
line = str(qid) + "|" + description + "\n"
|
|
||||||
outputfile.write(line)
|
|
||||||
|
|
||||||
|
|
||||||
def _write_training_entities(outputfile, article_id, clean_text, entities):
|
|
||||||
entities_data = [
|
|
||||||
{"alias": ent[0], "entity": ent[1], "start": ent[2], "end": ent[3]}
|
|
||||||
for ent in entities
|
|
||||||
]
|
|
||||||
line = (
|
|
||||||
json.dumps(
|
|
||||||
{
|
|
||||||
"article_id": article_id,
|
|
||||||
"clean_text": clean_text,
|
|
||||||
"entities": entities_data,
|
|
||||||
},
|
|
||||||
ensure_ascii=False,
|
|
||||||
)
|
|
||||||
+ "\n"
|
|
||||||
)
|
|
||||||
outputfile.write(line)
|
|
||||||
|
|
||||||
|
|
||||||
def read_training_indices(entity_file_path):
|
|
||||||
""" This method creates two lists of indices into the training file: one with indices for the
|
|
||||||
training examples, and one for the dev examples."""
|
|
||||||
train_indices = []
|
|
||||||
dev_indices = []
|
|
||||||
|
|
||||||
with entity_file_path.open("r", encoding="utf8") as file:
|
|
||||||
for i, line in enumerate(file):
|
|
||||||
example = json.loads(line)
|
|
||||||
article_id = example["article_id"]
|
|
||||||
clean_text = example["clean_text"]
|
|
||||||
|
|
||||||
if is_valid_article(clean_text):
|
|
||||||
if is_dev(article_id):
|
|
||||||
dev_indices.append(i)
|
|
||||||
else:
|
|
||||||
train_indices.append(i)
|
|
||||||
|
|
||||||
return train_indices, dev_indices
|
|
||||||
|
|
||||||
|
|
||||||
def read_el_docs_golds(nlp, entity_file_path, dev, line_ids, kb, labels_discard=None):
|
|
||||||
""" This method provides training/dev examples that correspond to the entity annotations found by the nlp object.
|
|
||||||
For training, it will include both positive and negative examples by using the candidate generator from the kb.
|
|
||||||
For testing (kb=None), it will include all positive examples only."""
|
|
||||||
if not labels_discard:
|
|
||||||
labels_discard = []
|
|
||||||
|
|
||||||
texts = []
|
|
||||||
entities_list = []
|
|
||||||
|
|
||||||
with entity_file_path.open("r", encoding="utf8") as file:
|
|
||||||
for i, line in enumerate(file):
|
|
||||||
if i in line_ids:
|
|
||||||
example = json.loads(line)
|
|
||||||
article_id = example["article_id"]
|
|
||||||
clean_text = example["clean_text"]
|
|
||||||
entities = example["entities"]
|
|
||||||
|
|
||||||
if dev != is_dev(article_id) or not is_valid_article(clean_text):
|
|
||||||
continue
|
|
||||||
|
|
||||||
texts.append(clean_text)
|
|
||||||
entities_list.append(entities)
|
|
||||||
|
|
||||||
docs = nlp.pipe(texts, batch_size=50)
|
|
||||||
|
|
||||||
for doc, entities in zip(docs, entities_list):
|
|
||||||
gold = _get_gold_parse(doc, entities, dev=dev, kb=kb, labels_discard=labels_discard)
|
|
||||||
if gold and len(gold.links) > 0:
|
|
||||||
yield doc, gold
|
|
||||||
|
|
||||||
|
|
||||||
def _get_gold_parse(doc, entities, dev, kb, labels_discard):
|
|
||||||
gold_entities = {}
|
|
||||||
tagged_ent_positions = {
|
|
||||||
(ent.start_char, ent.end_char): ent
|
|
||||||
for ent in doc.ents
|
|
||||||
if ent.label_ not in labels_discard
|
|
||||||
}
|
|
||||||
|
|
||||||
for entity in entities:
|
|
||||||
entity_id = entity["entity"]
|
|
||||||
alias = entity["alias"]
|
|
||||||
start = entity["start"]
|
|
||||||
end = entity["end"]
|
|
||||||
|
|
||||||
candidate_ids = []
|
|
||||||
if kb and not dev:
|
|
||||||
candidates = kb.get_candidates(alias)
|
|
||||||
candidate_ids = [cand.entity_ for cand in candidates]
|
|
||||||
|
|
||||||
tagged_ent = tagged_ent_positions.get((start, end), None)
|
|
||||||
if tagged_ent:
|
|
||||||
# TODO: check that alias == doc.text[start:end]
|
|
||||||
should_add_ent = (dev or entity_id in candidate_ids) and is_valid_sentence(
|
|
||||||
tagged_ent.sent.text
|
|
||||||
)
|
|
||||||
|
|
||||||
if should_add_ent:
|
|
||||||
value_by_id = {entity_id: 1.0}
|
|
||||||
if not dev:
|
|
||||||
random.shuffle(candidate_ids)
|
|
||||||
value_by_id.update(
|
|
||||||
{kb_id: 0.0 for kb_id in candidate_ids if kb_id != entity_id}
|
|
||||||
)
|
|
||||||
gold_entities[(start, end)] = value_by_id
|
|
||||||
|
|
||||||
return GoldParse(doc, links=gold_entities)
|
|
||||||
|
|
||||||
|
|
||||||
def is_dev(article_id):
|
|
||||||
if not article_id:
|
|
||||||
return False
|
|
||||||
return article_id.endswith("3")
|
|
||||||
|
|
||||||
|
|
||||||
def is_valid_article(doc_text):
|
|
||||||
# custom length cut-off
|
|
||||||
return 10 < len(doc_text) < 30000
|
|
||||||
|
|
||||||
|
|
||||||
def is_valid_sentence(sent_text):
|
|
||||||
if not 10 < len(sent_text) < 3000:
|
|
||||||
# custom length cut-off
|
|
||||||
return False
|
|
||||||
|
|
||||||
if sent_text.strip().startswith("*") or sent_text.strip().startswith("#"):
|
|
||||||
# remove 'enumeration' sentences (occurs often on Wikipedia)
|
|
||||||
return False
|
|
||||||
|
|
||||||
return True
|
|
|
@ -129,10 +129,7 @@ def train_textcat(nlp, n_texts, n_iter=10):
|
||||||
)
|
)
|
||||||
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
with nlp.select_pipes(enable="textcat"): # only train textcat
|
||||||
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
|
||||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
textcat.model.get_ref("tok2vec").from_bytes(tok2vec_weights)
|
textcat.model.get_ref("tok2vec").from_bytes(tok2vec_weights)
|
||||||
print("Training the model...")
|
print("Training the model...")
|
||||||
|
|
|
@ -62,11 +62,8 @@ def main(model_name, unlabelled_loc):
|
||||||
optimizer.b1 = 0.0
|
optimizer.b1 = 0.0
|
||||||
optimizer.b2 = 0.0
|
optimizer.b2 = 0.0
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
|
||||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
|
||||||
sizes = compounding(1.0, 4.0, 1.001)
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
with nlp.disable_pipes(*other_pipes):
|
with nlp.select_pipes(enable="ner"):
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(TRAIN_DATA)
|
||||||
random.shuffle(raw_docs)
|
random.shuffle(raw_docs)
|
||||||
|
|
|
@ -5,16 +5,17 @@ from spacy.gold import docs_to_json
|
||||||
import srsly
|
import srsly
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("Model name. Defaults to 'en'.", "option", "m", str),
|
model=("Model name. Defaults to 'en'.", "option", "m", str),
|
||||||
input_file=("Input file (jsonl)", "positional", None, Path),
|
input_file=("Input file (jsonl)", "positional", None, Path),
|
||||||
output_dir=("Output directory", "positional", None, Path),
|
output_dir=("Output directory", "positional", None, Path),
|
||||||
n_texts=("Number of texts to convert", "option", "t", int),
|
n_texts=("Number of texts to convert", "option", "t", int),
|
||||||
)
|
)
|
||||||
def convert(model='en', input_file=None, output_dir=None, n_texts=0):
|
def convert(model="en", input_file=None, output_dir=None, n_texts=0):
|
||||||
# Load model with tokenizer + sentencizer only
|
# Load model with tokenizer + sentencizer only
|
||||||
nlp = spacy.load(model)
|
nlp = spacy.load(model)
|
||||||
nlp.disable_pipes(*nlp.pipe_names)
|
nlp.select_pipes(disable=nlp.pipe_names)
|
||||||
sentencizer = nlp.create_pipe("sentencizer")
|
sentencizer = nlp.create_pipe("sentencizer")
|
||||||
nlp.add_pipe(sentencizer, first=True)
|
nlp.add_pipe(sentencizer, first=True)
|
||||||
|
|
||||||
|
@ -49,5 +50,6 @@ def convert(model='en', input_file=None, output_dir=None, n_texts=0):
|
||||||
|
|
||||||
srsly.write_json(output_dir / input_file.with_suffix(".json"), [docs_to_json(docs)])
|
srsly.write_json(output_dir / input_file.with_suffix(".json"), [docs_to_json(docs)])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
plac.call(convert)
|
plac.call(convert)
|
||||||
|
|
|
@ -97,7 +97,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
||||||
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
|
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
|
||||||
TRAIN_DOCS = []
|
TRAIN_DOCS = []
|
||||||
for text, annotation in TRAIN_DATA:
|
for text, annotation in TRAIN_DATA:
|
||||||
with nlp.disable_pipes("entity_linker"):
|
with nlp.select_pipes(disable="entity_linker"):
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
annotation_clean = annotation
|
annotation_clean = annotation
|
||||||
for offset, kb_id_dict in annotation["links"].items():
|
for offset, kb_id_dict in annotation["links"].items():
|
||||||
|
@ -112,10 +112,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
||||||
annotation_clean["links"][offset] = new_dict
|
annotation_clean["links"][offset] = new_dict
|
||||||
TRAIN_DOCS.append((doc, annotation_clean))
|
TRAIN_DOCS.append((doc, annotation_clean))
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
with nlp.select_pipes(enable="entity_linker"): # only train entity linker
|
||||||
pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
|
||||||
with nlp.disable_pipes(*other_pipes): # only train entity linker
|
|
||||||
# reset and initialize the weights randomly
|
# reset and initialize the weights randomly
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
|
|
||||||
|
|
|
@ -124,9 +124,7 @@ def main(model=None, output_dir=None, n_iter=15):
|
||||||
for dep in annotations.get("deps", []):
|
for dep in annotations.get("deps", []):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
|
|
||||||
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
|
with nlp.select_pipes(enable="parser"): # only train parser
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
|
||||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(TRAIN_DATA)
|
||||||
|
|
|
@ -55,10 +55,7 @@ def main(model=None, output_dir=None, n_iter=100):
|
||||||
print("Add label", ent[2])
|
print("Add label", ent[2])
|
||||||
ner.add_label(ent[2])
|
ner.add_label(ent[2])
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
with nlp.select_pipes(enable="ner"): # only train NER
|
||||||
pipe_exceptions = ["simple_ner"]
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
|
||||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
|
||||||
# reset and initialize the weights randomly – but only if we're
|
# reset and initialize the weights randomly – but only if we're
|
||||||
# training a new model
|
# training a new model
|
||||||
if model is None:
|
if model is None:
|
||||||
|
|
|
@ -94,10 +94,8 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
||||||
else:
|
else:
|
||||||
optimizer = nlp.resume_training()
|
optimizer = nlp.resume_training()
|
||||||
move_names = list(ner.move_names)
|
move_names = list(ner.move_names)
|
||||||
# get names of other pipes to disable them during training
|
|
||||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
with nlp.select_pipes(enable="ner"): # only train NER
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
|
||||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
|
||||||
sizes = compounding(1.0, 4.0, 1.001)
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
|
|
|
@ -64,10 +64,7 @@ def main(model=None, output_dir=None, n_iter=15):
|
||||||
for dep in annotations.get("deps", []):
|
for dep in annotations.get("deps", []):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
with nlp.select_pipes(enable="parser"): # only train parser
|
||||||
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
|
||||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(TRAIN_DATA)
|
||||||
|
|
|
@ -68,10 +68,7 @@ def main(config_path, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=Non
|
||||||
ex = Example.from_gold(gold, doc=doc)
|
ex = Example.from_gold(gold, doc=doc)
|
||||||
train_examples.append(ex)
|
train_examples.append(ex)
|
||||||
|
|
||||||
# get names of other pipes to disable them during training
|
with nlp.select_pipes(enable="textcat"): # only train textcat
|
||||||
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
|
||||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
if init_tok2vec is not None:
|
if init_tok2vec is not None:
|
||||||
with init_tok2vec.open("rb") as file_:
|
with init_tok2vec.open("rb") as file_:
|
||||||
|
|
|
@ -145,7 +145,7 @@ def train(
|
||||||
msg.text(f"Loading vectors from model '{vectors}'")
|
msg.text(f"Loading vectors from model '{vectors}'")
|
||||||
_load_vectors(nlp, vectors)
|
_load_vectors(nlp, vectors)
|
||||||
|
|
||||||
nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
|
nlp.select_pipes(disable=[p for p in nlp.pipe_names if p not in pipeline])
|
||||||
for pipe in pipeline:
|
for pipe in pipeline:
|
||||||
# first, create the model.
|
# first, create the model.
|
||||||
# Bit of a hack after the refactor to get the vectors into a default config
|
# Bit of a hack after the refactor to get the vectors into a default config
|
||||||
|
@ -201,8 +201,8 @@ def train(
|
||||||
exits=1,
|
exits=1,
|
||||||
)
|
)
|
||||||
msg.text(f"Extending component from base model '{pipe}'")
|
msg.text(f"Extending component from base model '{pipe}'")
|
||||||
disabled_pipes = nlp.disable_pipes(
|
disabled_pipes = nlp.select_pipes(
|
||||||
[p for p in nlp.pipe_names if p not in pipeline]
|
disable=[p for p in nlp.pipe_names if p not in pipeline]
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
msg.text(f"Starting with blank model '{lang}'")
|
msg.text(f"Starting with blank model '{lang}'")
|
||||||
|
|
|
@ -104,6 +104,8 @@ class Warnings(object):
|
||||||
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
||||||
|
|
||||||
# TODO: fix numbering after merging develop into master
|
# TODO: fix numbering after merging develop into master
|
||||||
|
W096 = ("The method 'disable_pipes' has become deprecated - use 'select_pipes' "
|
||||||
|
"instead.")
|
||||||
W097 = ("No Model config was provided to create the '{name}' component, "
|
W097 = ("No Model config was provided to create the '{name}' component, "
|
||||||
"and no default configuration could be found either.")
|
"and no default configuration could be found either.")
|
||||||
W098 = ("No Model config was provided to create the '{name}' component, "
|
W098 = ("No Model config was provided to create the '{name}' component, "
|
||||||
|
@ -132,7 +134,7 @@ class Errors(object):
|
||||||
E007 = ("'{name}' already exists in pipeline. Existing names: {opts}")
|
E007 = ("'{name}' already exists in pipeline. Existing names: {opts}")
|
||||||
E008 = ("Some current components would be lost when restoring previous "
|
E008 = ("Some current components would be lost when restoring previous "
|
||||||
"pipeline state. If you added components after calling "
|
"pipeline state. If you added components after calling "
|
||||||
"`nlp.disable_pipes()`, you should remove them explicitly with "
|
"`nlp.select_pipes()`, you should remove them explicitly with "
|
||||||
"`nlp.remove_pipe()` before the pipeline is restored. Names of "
|
"`nlp.remove_pipe()` before the pipeline is restored. Names of "
|
||||||
"the new components: {names}")
|
"the new components: {names}")
|
||||||
E009 = ("The `update` method expects same number of docs and golds, but "
|
E009 = ("The `update` method expects same number of docs and golds, but "
|
||||||
|
@ -546,6 +548,13 @@ class Errors(object):
|
||||||
"token itself.")
|
"token itself.")
|
||||||
|
|
||||||
# TODO: fix numbering after merging develop into master
|
# TODO: fix numbering after merging develop into master
|
||||||
|
E991 = ("The function 'select_pipes' should be called with either a "
|
||||||
|
"'disable' argument to list the names of the pipe components "
|
||||||
|
"that should be disabled, or with an 'enable' argument that "
|
||||||
|
"specifies which pipes should not be disabled.")
|
||||||
|
E992 = ("The function `select_pipes` was called with `enable`={enable} "
|
||||||
|
"and `disable`={disable} but that information is conflicting "
|
||||||
|
"for the `nlp` pipeline with components {names}.")
|
||||||
E993 = ("The config for 'nlp' should include either a key 'name' to "
|
E993 = ("The config for 'nlp' should include either a key 'name' to "
|
||||||
"refer to an existing model by name or path, or a key 'lang' "
|
"refer to an existing model by name or path, or a key 'lang' "
|
||||||
"to create a new blank model.")
|
"to create a new blank model.")
|
||||||
|
|
|
@ -511,11 +511,37 @@ class Language(object):
|
||||||
of the block. Otherwise, a DisabledPipes object is returned, that has
|
of the block. Otherwise, a DisabledPipes object is returned, that has
|
||||||
a `.restore()` method you can use to undo your changes.
|
a `.restore()` method you can use to undo your changes.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/language#disable_pipes
|
This method has been deprecated since 3.0
|
||||||
"""
|
"""
|
||||||
|
warnings.warn(Warnings.W096, DeprecationWarning)
|
||||||
if len(names) == 1 and isinstance(names[0], (list, tuple)):
|
if len(names) == 1 and isinstance(names[0], (list, tuple)):
|
||||||
names = names[0] # support list of names instead of spread
|
names = names[0] # support list of names instead of spread
|
||||||
return DisabledPipes(self, *names)
|
return DisabledPipes(self, names)
|
||||||
|
|
||||||
|
def select_pipes(self, disable=None, enable=None):
|
||||||
|
"""Disable one or more pipeline components. If used as a context
|
||||||
|
manager, the pipeline will be restored to the initial state at the end
|
||||||
|
of the block. Otherwise, a DisabledPipes object is returned, that has
|
||||||
|
a `.restore()` method you can use to undo your changes.
|
||||||
|
|
||||||
|
disable (str or iterable): The name(s) of the pipes to disable
|
||||||
|
enable (str or iterable): The name(s) of the pipes to enable - all others will be disabled
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#select_pipes
|
||||||
|
"""
|
||||||
|
if enable is None and disable is None:
|
||||||
|
raise ValueError(Errors.E991)
|
||||||
|
if disable is not None and isinstance(disable, str):
|
||||||
|
disable = [disable]
|
||||||
|
if enable is not None:
|
||||||
|
if isinstance(enable, str):
|
||||||
|
enable = [enable]
|
||||||
|
to_disable = [pipe for pipe in self.pipe_names if pipe not in enable]
|
||||||
|
# raise an error if the enable and disable keywords are not consistent
|
||||||
|
if disable is not None and disable != to_disable:
|
||||||
|
raise ValueError(Errors.E992.format(enable=enable, disable=disable, names=self.pipe_names))
|
||||||
|
disable = to_disable
|
||||||
|
return DisabledPipes(self, disable)
|
||||||
|
|
||||||
def make_doc(self, text):
|
def make_doc(self, text):
|
||||||
return self.tokenizer(text)
|
return self.tokenizer(text)
|
||||||
|
@ -1117,7 +1143,7 @@ def _fix_pretrained_vectors_name(nlp):
|
||||||
class DisabledPipes(list):
|
class DisabledPipes(list):
|
||||||
"""Manager for temporary pipeline disabling."""
|
"""Manager for temporary pipeline disabling."""
|
||||||
|
|
||||||
def __init__(self, nlp, *names):
|
def __init__(self, nlp, names):
|
||||||
self.nlp = nlp
|
self.nlp = nlp
|
||||||
self.names = names
|
self.names = names
|
||||||
# Important! Not deep copy -- we just want the container (but we also
|
# Important! Not deep copy -- we just want the container (but we also
|
||||||
|
|
|
@ -200,7 +200,7 @@ class EntityRuler(object):
|
||||||
]
|
]
|
||||||
except ValueError:
|
except ValueError:
|
||||||
subsequent_pipes = []
|
subsequent_pipes = []
|
||||||
with self.nlp.disable_pipes(subsequent_pipes):
|
with self.nlp.select_pipes(disable=subsequent_pipes):
|
||||||
token_patterns = []
|
token_patterns = []
|
||||||
phrase_pattern_labels = []
|
phrase_pattern_labels = []
|
||||||
phrase_pattern_texts = []
|
phrase_pattern_texts = []
|
||||||
|
|
|
@ -88,7 +88,16 @@ def test_remove_pipe(nlp, name):
|
||||||
def test_disable_pipes_method(nlp, name):
|
def test_disable_pipes_method(nlp, name):
|
||||||
nlp.add_pipe(new_pipe, name=name)
|
nlp.add_pipe(new_pipe, name=name)
|
||||||
assert nlp.has_pipe(name)
|
assert nlp.has_pipe(name)
|
||||||
disabled = nlp.disable_pipes(name)
|
disabled = nlp.select_pipes(disable=name)
|
||||||
|
assert not nlp.has_pipe(name)
|
||||||
|
disabled.restore()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("name", ["my_component"])
|
||||||
|
def test_enable_pipes_method(nlp, name):
|
||||||
|
nlp.add_pipe(new_pipe, name=name)
|
||||||
|
assert nlp.has_pipe(name)
|
||||||
|
disabled = nlp.select_pipes(enable=[])
|
||||||
assert not nlp.has_pipe(name)
|
assert not nlp.has_pipe(name)
|
||||||
disabled.restore()
|
disabled.restore()
|
||||||
|
|
||||||
|
@ -97,19 +106,57 @@ def test_disable_pipes_method(nlp, name):
|
||||||
def test_disable_pipes_context(nlp, name):
|
def test_disable_pipes_context(nlp, name):
|
||||||
nlp.add_pipe(new_pipe, name=name)
|
nlp.add_pipe(new_pipe, name=name)
|
||||||
assert nlp.has_pipe(name)
|
assert nlp.has_pipe(name)
|
||||||
with nlp.disable_pipes(name):
|
with nlp.select_pipes(disable=name):
|
||||||
assert not nlp.has_pipe(name)
|
assert not nlp.has_pipe(name)
|
||||||
assert nlp.has_pipe(name)
|
assert nlp.has_pipe(name)
|
||||||
|
|
||||||
|
|
||||||
def test_disable_pipes_list_arg(nlp):
|
def test_select_pipes_list_arg(nlp):
|
||||||
for name in ["c1", "c2", "c3"]:
|
for name in ["c1", "c2", "c3"]:
|
||||||
nlp.add_pipe(new_pipe, name=name)
|
nlp.add_pipe(new_pipe, name=name)
|
||||||
assert nlp.has_pipe(name)
|
assert nlp.has_pipe(name)
|
||||||
with nlp.disable_pipes(["c1", "c2"]):
|
with nlp.select_pipes(disable=["c1", "c2"]):
|
||||||
assert not nlp.has_pipe("c1")
|
assert not nlp.has_pipe("c1")
|
||||||
assert not nlp.has_pipe("c2")
|
assert not nlp.has_pipe("c2")
|
||||||
assert nlp.has_pipe("c3")
|
assert nlp.has_pipe("c3")
|
||||||
|
with nlp.select_pipes(enable="c3"):
|
||||||
|
assert not nlp.has_pipe("c1")
|
||||||
|
assert not nlp.has_pipe("c2")
|
||||||
|
assert nlp.has_pipe("c3")
|
||||||
|
with nlp.select_pipes(enable=["c1", "c2"], disable="c3"):
|
||||||
|
assert nlp.has_pipe("c1")
|
||||||
|
assert nlp.has_pipe("c2")
|
||||||
|
assert not nlp.has_pipe("c3")
|
||||||
|
with nlp.select_pipes(enable=[]):
|
||||||
|
assert not nlp.has_pipe("c1")
|
||||||
|
assert not nlp.has_pipe("c2")
|
||||||
|
assert not nlp.has_pipe("c3")
|
||||||
|
with nlp.select_pipes(enable=["c1", "c2", "c3"], disable=[]):
|
||||||
|
assert nlp.has_pipe("c1")
|
||||||
|
assert nlp.has_pipe("c2")
|
||||||
|
assert nlp.has_pipe("c3")
|
||||||
|
with nlp.select_pipes(disable=["c1", "c2", "c3"], enable=[]):
|
||||||
|
assert not nlp.has_pipe("c1")
|
||||||
|
assert not nlp.has_pipe("c2")
|
||||||
|
assert not nlp.has_pipe("c3")
|
||||||
|
|
||||||
|
|
||||||
|
def test_select_pipes_errors(nlp):
|
||||||
|
for name in ["c1", "c2", "c3"]:
|
||||||
|
nlp.add_pipe(new_pipe, name=name)
|
||||||
|
assert nlp.has_pipe(name)
|
||||||
|
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
nlp.select_pipes()
|
||||||
|
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
nlp.select_pipes(enable=["c1", "c2"], disable=["c1"])
|
||||||
|
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
nlp.select_pipes(enable=["c1", "c2"], disable=[])
|
||||||
|
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
nlp.select_pipes(enable=[], disable=["c3"])
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("n_pipes", [100])
|
@pytest.mark.parametrize("n_pipes", [100])
|
||||||
|
|
|
@ -31,7 +31,7 @@ def test_issue3611():
|
||||||
nlp.add_pipe(textcat, last=True)
|
nlp.add_pipe(textcat, last=True)
|
||||||
|
|
||||||
# training the network
|
# training the network
|
||||||
with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):
|
with nlp.select_pipes(enable="textcat"):
|
||||||
optimizer = nlp.begin_training(X=x_train, Y=y_train)
|
optimizer = nlp.begin_training(X=x_train, Y=y_train)
|
||||||
for i in range(3):
|
for i in range(3):
|
||||||
losses = {}
|
losses = {}
|
||||||
|
|
|
@ -31,7 +31,7 @@ def test_issue4030():
|
||||||
nlp.add_pipe(textcat, last=True)
|
nlp.add_pipe(textcat, last=True)
|
||||||
|
|
||||||
# training the network
|
# training the network
|
||||||
with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):
|
with nlp.select_pipes(enable="textcat"):
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for i in range(3):
|
for i in range(3):
|
||||||
losses = {}
|
losses = {}
|
||||||
|
|
|
@ -314,45 +314,47 @@ component function.
|
||||||
| `name` | unicode | Name of the component to remove. |
|
| `name` | unicode | Name of the component to remove. |
|
||||||
| **RETURNS** | tuple | A `(name, component)` tuple of the removed component. |
|
| **RETURNS** | tuple | A `(name, component)` tuple of the removed component. |
|
||||||
|
|
||||||
## Language.disable_pipes {#disable_pipes tag="contextmanager, method" new="2"}
|
## Language.select_pipes {#select_pipes tag="contextmanager, method" new="3"}
|
||||||
|
|
||||||
Disable one or more pipeline components. If used as a context manager, the
|
Disable one or more pipeline components. If used as a context manager, the
|
||||||
pipeline will be restored to the initial state at the end of the block.
|
pipeline will be restored to the initial state at the end of the block.
|
||||||
Otherwise, a `DisabledPipes` object is returned, that has a `.restore()` method
|
Otherwise, a `DisabledPipes` object is returned, that has a `.restore()` method
|
||||||
you can use to undo your changes.
|
you can use to undo your changes.
|
||||||
|
|
||||||
|
You can specify either `disable` (as a list or string), or `enable`. In the
|
||||||
|
latter case, all components not in the `enable` list, will be disabled.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> # New API as of v2.2.2
|
> # New API as of v3.0
|
||||||
> with nlp.disable_pipes(["tagger", "parser"]):
|
> with nlp.select_pipes(disable=["tagger", "parser"]):
|
||||||
> nlp.begin_training()
|
> nlp.begin_training()
|
||||||
>
|
>
|
||||||
> with nlp.disable_pipes("tagger", "parser"):
|
> with nlp.select_pipes(enable="ner"):
|
||||||
> nlp.begin_training()
|
> nlp.begin_training()
|
||||||
>
|
>
|
||||||
> disabled = nlp.disable_pipes("tagger", "parser")
|
> disabled = nlp.select_pipes(disable=["tagger", "parser"])
|
||||||
> nlp.begin_training()
|
> nlp.begin_training()
|
||||||
> disabled.restore()
|
> disabled.restore()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------------------------------------- | --------------- | ------------------------------------------------------------------------------------ |
|
| ----------- | --------------- | ------------------------------------------------------------------------------------ |
|
||||||
| `disabled` <Tag variant="new">2.2.2</Tag> | list | Names of pipeline components to disable. |
|
| `disable` | list | Names of pipeline components to disable. |
|
||||||
| `*disabled` | unicode | Names of pipeline components to disable. |
|
| `disable` | unicode | Name of pipeline component to disable. |
|
||||||
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
|
| `enable` | list | Names of pipeline components that will not be disabled. |
|
||||||
|
| `enable` | unicode | Name of pipeline component that will not be disabled. |
|
||||||
|
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
|
||||||
|
|
||||||
<Infobox title="Changed in v2.2.2" variant="warning">
|
|
||||||
|
|
||||||
As of spaCy v2.2.2, the `Language.disable_pipes` method can also take a list of
|
<Infobox title="Changed in v3.0" variant="warning">
|
||||||
component names as its first argument (instead of a variable number of
|
|
||||||
arguments). This is especially useful if you're generating the component names
|
As of spaCy v3.0, the `disable_pipes` method has been renamed to `select_pipes`:
|
||||||
to disable programmatically. The new syntax will become the default in the
|
|
||||||
future.
|
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
- disabled = nlp.disable_pipes("tagger", "parser")
|
- nlp.disable_pipes(["tagger", "parser"])
|
||||||
+ disabled = nlp.disable_pipes(["tagger", "parser"])
|
+ nlp.select_pipes(disable=["tagger", "parser"])
|
||||||
```
|
```
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
|
@ -252,9 +252,9 @@ for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
|
||||||
|
|
||||||
If you need to **execute more code** with components disabled – e.g. to reset
|
If you need to **execute more code** with components disabled – e.g. to reset
|
||||||
the weights or update only some components during training – you can use the
|
the weights or update only some components during training – you can use the
|
||||||
[`nlp.disable_pipes`](/api/language#disable_pipes) contextmanager. At the end of
|
[`nlp.select_pipes`](/api/language#select_pipes) contextmanager. At the end of
|
||||||
the `with` block, the disabled pipeline components will be restored
|
the `with` block, the disabled pipeline components will be restored
|
||||||
automatically. Alternatively, `disable_pipes` returns an object that lets you
|
automatically. Alternatively, `select_pipes` returns an object that lets you
|
||||||
call its `restore()` method to restore the disabled components when needed. This
|
call its `restore()` method to restore the disabled components when needed. This
|
||||||
can be useful if you want to prevent unnecessary code indentation of large
|
can be useful if you want to prevent unnecessary code indentation of large
|
||||||
blocks.
|
blocks.
|
||||||
|
@ -262,16 +262,26 @@ blocks.
|
||||||
```python
|
```python
|
||||||
### Disable for block
|
### Disable for block
|
||||||
# 1. Use as a contextmanager
|
# 1. Use as a contextmanager
|
||||||
with nlp.disable_pipes("tagger", "parser"):
|
with nlp.select_pipes(disable=["tagger", "parser"]):
|
||||||
doc = nlp("I won't be tagged and parsed")
|
doc = nlp("I won't be tagged and parsed")
|
||||||
doc = nlp("I will be tagged and parsed")
|
doc = nlp("I will be tagged and parsed")
|
||||||
|
|
||||||
# 2. Restore manually
|
# 2. Restore manually
|
||||||
disabled = nlp.disable_pipes("ner")
|
disabled = nlp.select_pipes(disable="ner")
|
||||||
doc = nlp("I won't have named entities")
|
doc = nlp("I won't have named entities")
|
||||||
disabled.restore()
|
disabled.restore()
|
||||||
```
|
```
|
||||||
|
|
||||||
|
If you want to disable all pipes except for one or a few, you can use the `enable`
|
||||||
|
keyword. Just like the `disable` keyword, it takes a list of pipe names, or a string
|
||||||
|
defining just one pipe.
|
||||||
|
```python
|
||||||
|
# Enable only the parser
|
||||||
|
with nlp.select_pipes(enable="parser"):
|
||||||
|
doc = nlp("I will only be parsed")
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
|
Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
|
||||||
to remove pipeline components from an existing pipeline, the
|
to remove pipeline components from an existing pipeline, the
|
||||||
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
||||||
|
|
|
@ -906,7 +906,7 @@ pipeline component, **make sure that the pipeline component runs** when you
|
||||||
create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
|
create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
|
||||||
objects need to have part-of-speech tags set by the `tagger`. You can either
|
objects need to have part-of-speech tags set by the `tagger`. You can either
|
||||||
call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use
|
call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use
|
||||||
[`nlp.disable_pipes`](/api/language#disable_pipes) to disable components
|
[`nlp.select_pipes`](/api/language#select_pipes) to disable components
|
||||||
selectively.
|
selectively.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
@ -1121,8 +1121,7 @@ while adding the phrase patterns.
|
||||||
entityruler = EntityRuler(nlp)
|
entityruler = EntityRuler(nlp)
|
||||||
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
|
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
|
||||||
|
|
||||||
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
|
with nlp.select_pipes(enable="tagger"):
|
||||||
with nlp.disable_pipes(*other_pipes):
|
|
||||||
entityruler.add_patterns(patterns)
|
entityruler.add_patterns(patterns)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
|
@ -647,8 +647,7 @@ import random
|
||||||
nlp = spacy.load("en_core_web_sm")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})]
|
train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})]
|
||||||
|
|
||||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
|
with nlp.select_pipes(enable="ner"):
|
||||||
with nlp.disable_pipes(*other_pipes):
|
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for i in range(10):
|
for i in range(10):
|
||||||
random.shuffle(train_data)
|
random.shuffle(train_data)
|
||||||
|
|
|
@ -362,7 +362,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_ner.py
|
||||||
you're using a blank model, don't forget to add the entity recognizer to the
|
you're using a blank model, don't forget to add the entity recognizer to the
|
||||||
pipeline. If you're using an existing model, make sure to disable all other
|
pipeline. If you're using an existing model, make sure to disable all other
|
||||||
pipeline components during training using
|
pipeline components during training using
|
||||||
[`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
|
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||||||
training the entity recognizer.
|
training the entity recognizer.
|
||||||
2. **Shuffle and loop over** the examples. For each example, **update the
|
2. **Shuffle and loop over** the examples. For each example, **update the
|
||||||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||||||
|
@ -403,7 +403,7 @@ referred to as the "catastrophic forgetting" problem.
|
||||||
you're using a blank model, don't forget to add the entity recognizer to the
|
you're using a blank model, don't forget to add the entity recognizer to the
|
||||||
pipeline. If you're using an existing model, make sure to disable all other
|
pipeline. If you're using an existing model, make sure to disable all other
|
||||||
pipeline components during training using
|
pipeline components during training using
|
||||||
[`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
|
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||||||
training the entity recognizer.
|
training the entity recognizer.
|
||||||
2. **Add the new entity label** to the entity recognizer using the
|
2. **Add the new entity label** to the entity recognizer using the
|
||||||
[`add_label`](/api/entityrecognizer#add_label) method. You can access the
|
[`add_label`](/api/entityrecognizer#add_label) method. You can access the
|
||||||
|
@ -436,7 +436,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_parser.py
|
||||||
you're using a blank model, don't forget to add the parser to the pipeline.
|
you're using a blank model, don't forget to add the parser to the pipeline.
|
||||||
If you're using an existing model, make sure to disable all other pipeline
|
If you're using an existing model, make sure to disable all other pipeline
|
||||||
components during training using
|
components during training using
|
||||||
[`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
|
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||||||
training the parser.
|
training the parser.
|
||||||
2. **Add the dependency labels** to the parser using the
|
2. **Add the dependency labels** to the parser using the
|
||||||
[`add_label`](/api/dependencyparser#add_label) method. If you're starting off
|
[`add_label`](/api/dependencyparser#add_label) method. If you're starting off
|
||||||
|
@ -470,7 +470,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_tagger.py
|
||||||
you're using a blank model, don't forget to add the tagger to the pipeline.
|
you're using a blank model, don't forget to add the tagger to the pipeline.
|
||||||
If you're using an existing model, make sure to disable all other pipeline
|
If you're using an existing model, make sure to disable all other pipeline
|
||||||
components during training using
|
components during training using
|
||||||
[`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
|
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||||||
training the tagger.
|
training the tagger.
|
||||||
2. **Add the tag map** to the tagger using the
|
2. **Add the tag map** to the tagger using the
|
||||||
[`add_label`](/api/tagger#add_label) method. The first argument is the new
|
[`add_label`](/api/tagger#add_label) method. The first argument is the new
|
||||||
|
@ -544,7 +544,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_intent_pa
|
||||||
you're using a blank model, don't forget to add the custom parser to the
|
you're using a blank model, don't forget to add the custom parser to the
|
||||||
pipeline. If you're using an existing model, make sure to **remove the old
|
pipeline. If you're using an existing model, make sure to **remove the old
|
||||||
parser** from the pipeline, and disable all other pipeline components during
|
parser** from the pipeline, and disable all other pipeline components during
|
||||||
training using [`nlp.disable_pipes`](/api/language#disable_pipes). This way,
|
training using [`nlp.select_pipes`](/api/language#select_pipes). This way,
|
||||||
you'll only be training the parser.
|
you'll only be training the parser.
|
||||||
3. **Add the dependency labels** to the parser using the
|
3. **Add the dependency labels** to the parser using the
|
||||||
[`add_label`](/api/dependencyparser#add_label) method.
|
[`add_label`](/api/dependencyparser#add_label) method.
|
||||||
|
@ -576,7 +576,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p
|
||||||
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
|
[`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If
|
||||||
you're using an existing model, make sure to disable all other pipeline
|
you're using an existing model, make sure to disable all other pipeline
|
||||||
components during training using
|
components during training using
|
||||||
[`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be
|
[`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be
|
||||||
training the text classifier.
|
training the text classifier.
|
||||||
2. **Add the text classifier** to the pipeline, and add the labels you want to
|
2. **Add the text classifier** to the pipeline, and add the labels you want to
|
||||||
train – for example, `POSITIVE`.
|
train – for example, `POSITIVE`.
|
||||||
|
@ -653,7 +653,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_li
|
||||||
pipeline including also a component for
|
pipeline including also a component for
|
||||||
[named entity recognition](/usage/training#ner). If you're using a model with
|
[named entity recognition](/usage/training#ner). If you're using a model with
|
||||||
additional components, make sure to disable all other pipeline components
|
additional components, make sure to disable all other pipeline components
|
||||||
during training using [`nlp.disable_pipes`](/api/language#disable_pipes).
|
during training using [`nlp.select_pipes`](/api/language#select_pipes).
|
||||||
This way, you'll only be training the entity linker.
|
This way, you'll only be training the entity linker.
|
||||||
2. **Shuffle and loop over** the examples. For each example, **update the
|
2. **Shuffle and loop over** the examples. For each example, **update the
|
||||||
model** by calling [`nlp.update`](/api/language#update), which steps through
|
model** by calling [`nlp.update`](/api/language#update), which steps through
|
||||||
|
|
Loading…
Reference in New Issue
Block a user