spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-06-04 13:13:10 +03:00

History

Sofie Van Landeghem e48a09df4e Example class for training data (#4543 ) * OrigAnnot class instead of gold.orig_annot list of zipped tuples * from_orig to replace from_annot_tuples * rename to RawAnnot * some unit tests for GoldParse creation and internal format * removing orig_annot and switching to lists instead of tuple * rewriting tuples to use RawAnnot (+ debug statements, WIP) * fix pop() changing the data * small fixes * pop-append fixes * return RawAnnot for existing GoldParse to have uniform interface * clean up imports * fix merge_sents * add unit test for 4402 with new structure (not working yet) * introduce DocAnnot * typo fixes * add unit test for merge_sents * rename from_orig to from_raw * fixing unit tests * fix nn parser * read_annots to produce text, doc_annot pairs * _make_golds fix * rename golds_to_gold_annots * small fixes * fix encoding * have golds_to_gold_annots use DocAnnot * missed a spot * merge_sents as function in DocAnnot * allow specifying only part of the token-level annotations * refactor with Example class + underlying dicts * pipeline components to work with Example objects (wip) * input checking * fix yielding * fix calls to update * small fixes * fix scorer unit test with new format * fix kwargs order * fixes for ud and conllu scripts * fix reading data for conllu script * add in proper errors (not fixed numbering yet to avoid merge conflicts) * fixing few more small bugs * fix EL script		2019-11-11 17:35:27 +01:00
..
__init__.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
entity_linker_evaluation.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
kb_creator.py	Example class for training data (#4543 )	2019-11-11 17:35:27 +01:00
README.md	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
train_descriptions.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wiki_io.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wiki_namespaces.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wikidata_pretrain_kb.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wikidata_processor.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wikidata_train_entity_linker.py	Example class for training data (#4543 )	2019-11-11 17:35:27 +01:00
wikipedia_processor.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00

README.md

Entity Linking with Wikipedia and Wikidata

Step 1: Create a Knowledge Base (KB) and training data

Run wikipedia_pretrain_kb.py

This takes as input the locations of a Wikipedia and a Wikidata dump, and produces a KB directory + training file
- WikiData: get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
- Wikipedia: get enwiki-latest-pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
You can set the filtering parameters for KB construction:
- max_per_alias: (max) number of candidate entities in the KB per alias/synonym
- min_freq: threshold of number of times an entity should occur in the corpus to be included in the KB
- min_pair: threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
Further parameters to set:
- descriptions_from_wikipedia: whether to parse descriptions from Wikipedia (True) or Wikidata (False)
- entity_vector_length: length of the pre-trained entity description vectors
- lang: language for which to fetch Wikidata information (as the dump contains all languages)

Quick testing and rerunning:

When trying out the pipeline for a quick test, set limit_prior, limit_train and/or limit_wd to read only parts of the dumps instead of everything.
If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.

Step 2: Train an Entity Linking model

Run wikidata_train_entity_linker.py

This takes the KB directory produced by Step 1, and trains an Entity Linking model
You can set the learning parameters for the EL training:
- epochs: number of training iterations
- dropout: dropout rate
- lr: learning rate
- l2: L2 regularization
Specify the number of training and dev testing entities with train_inst and dev_inst respectively
Further parameters to set:
- labels_discard: NER label types to discard during training