spaCy/spacy
Sofie Van Landeghem 2d249a9502 KB extensions and better parsing of WikiData (#4375)
* fix overflow error on windows

* more documentation & logging fixes

* md fix

* 3 different limit parameters to play with execution time

* bug fixes directory locations

* small fixes

* exclude dev test articles from prior probabilities stats

* small fixes

* filtering wikidata entities, removing numeric and meta items

* adding aliases from wikidata also to the KB

* fix adding WD aliases

* adding also new aliases to previously added entities

* fixing comma's

* small doc fixes

* adding subclassof filtering

* append alias functionality in KB

* prevent appending the same entity-alias pair

* fix for appending WD aliases

* remove date filter

* remove unnecessary import

* small corrections and reformatting

* remove WD aliases for now (too slow)

* removing numeric entities from training and evaluation

* small fixes

* shortcut during prediction if there is only one candidate

* add counts and fscore logging, remove FP NER from evaluation

* fix entity_linker.predict to take docs instead of single sentences

* remove enumeration sentences from the WP dataset

* entity_linker.update to process full doc instead of single sentence

* spelling corrections and dump locations in readme

* NLP IO fix

* reading KB is unnecessary at the end of the pipeline

* small logging fix

* remove empty files
2019-10-14 12:28:53 +02:00
..
cli KB extensions and better parsing of WikiData (#4375) 2019-10-14 12:28:53 +02:00
data
displacy Move lookup tables out of the core library (#4346) 2019-10-01 00:01:27 +02:00
lang Initial commit: New language Luxembourgish (lb) (#4424) 2019-10-14 12:27:50 +02:00
matcher Fix PhraseMatcher.remove for overlapping patterns (#4437) 2019-10-14 12:19:51 +02:00
pipeline KB extensions and better parsing of WikiData (#4375) 2019-10-14 12:28:53 +02:00
syntax Ensure the NER remains consistent after resizing (#4330) 2019-09-27 20:57:13 +02:00
tests KB extensions and better parsing of WikiData (#4375) 2019-10-14 12:28:53 +02:00
tokens Bugfix initializing DocBin with attributes (#4368) 2019-10-03 14:48:45 +02:00
__init__.pxd
__init__.py Add registry for model creation functions ('architectures') (#4395) 2019-10-08 12:21:03 +02:00
__main__.py Update __main__.py 2019-03-20 09:43:26 +01:00
_align.pyx Improve alignment around quotes 2018-08-16 01:04:34 +02:00
_ml.py Improve spacy pretrain (#4393) 2019-10-07 23:34:58 +02:00
about.py Set version to v2.2.1 2019-10-03 14:50:39 +02:00
attrs.pxd Fix attrs alignment 2019-07-12 17:59:47 +02:00
attrs.pyx Bugfix initializing DocBin with attributes (#4368) 2019-10-03 14:48:45 +02:00
compat.py Improve usage of pkg_resources and handling of entry points (#4387) 2019-10-07 17:22:09 +02:00
errors.py KB extensions and better parsing of WikiData (#4375) 2019-10-14 12:28:53 +02:00
glossary.py Include Norwegian NER entity types in glossary [ci skip] 2019-09-15 17:16:21 +02:00
gold.pxd Merge changes from master 2019-08-21 14:18:52 +02:00
gold.pyx Fix orth replacement 2019-09-19 00:03:24 +02:00
kb.pxd rename entity frequency 2019-07-19 17:40:28 +02:00
kb.pyx KB extensions and better parsing of WikiData (#4375) 2019-10-14 12:28:53 +02:00
language.py KB extensions and better parsing of WikiData (#4375) 2019-10-14 12:28:53 +02:00
lemmatizer.py Refactor lemmatizer and data table integration (#4353) 2019-10-01 21:36:03 +02:00
lexeme.pxd 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325) 2019-02-24 21:13:51 +01:00
lexeme.pyx Alphanumeric -> alphabetic [ci skip] 2019-10-06 13:30:01 +02:00
lookups.py Refactor lemmatizer and data table integration (#4353) 2019-10-01 21:36:03 +02:00
morphology.pxd annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
morphology.pyx Improve Morphology errors (#4314) 2019-09-21 14:37:06 +02:00
parts_of_speech.pxd
parts_of_speech.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
scorer.py Make except more explicit 2019-09-18 19:57:08 +02:00
strings.pxd Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
strings.pyx Merge branch 'master' into feature/lemmatizer 2019-03-16 13:44:22 +01:00
structs.pxd Merge changes from master 2019-08-21 14:18:52 +02:00
symbols.pxd Fix symbol alignment 2019-07-12 17:48:38 +02:00
symbols.pyx ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
tokenizer.pxd Flush tokenizer cache when necessary (#4258) 2019-09-08 20:52:46 +02:00
tokenizer.pyx Improve URL_PATTERN and handling in tokenizer (#4374) 2019-10-05 13:00:09 +02:00
typedefs.pxd
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py Fix util.filter_spans() to prefer first span in overlapping sam… (#4414) 2019-10-10 17:00:03 +02:00
vectors.pyx Consider batch_size when sorting similar vectors (#4388) 2019-10-07 13:38:35 +02:00
vocab.pxd 💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) 2019-08-22 14:21:32 +02:00
vocab.pyx most_similar() return the k most similar vectors (#4364) 2019-10-03 14:09:44 +02:00