spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-09 16:59:55 +03:00

Author	SHA1	Message	Date
svlandeg	d51bffe63b	clean up code	2019-05-16 18:36:15 +02:00
svlandeg	b5470f3d75	various tests, architectures and experiments	2019-05-16 18:25:34 +02:00
svlandeg	9ffe5437ae	calculate gradient for entity encoding	2019-05-15 02:23:08 +02:00
svlandeg	2713abc651	implement loss function using dot product and prob estimate per candidate cluster	2019-05-14 22:55:56 +02:00
svlandeg	09ed446b20	different architecture / settings	2019-05-14 08:37:52 +02:00
svlandeg	4142e8dd1b	train and predict per article (saving time for doc encoding)	2019-05-13 17:02:34 +02:00
svlandeg	3b81b00954	evaluating on dev set during training	2019-05-13 14:26:04 +02:00
svlandeg	b6d788064a	some first experiments with different architectures and metrics	2019-05-10 12:53:14 +02:00
svlandeg	9d089c0410	grouping clusters of instances per doc+mention	2019-05-09 18:11:49 +02:00
svlandeg	c6ca8649d7	first stab at model - not functional yet	2019-05-09 17:23:19 +02:00
svlandeg	9f33732b96	using entity descriptions and article texts as input embedding vectors for training	2019-05-07 16:03:42 +02:00
svlandeg	7e348d7f7f	baseline evaluation using highest-freq candidate	2019-05-06 15:13:50 +02:00
svlandeg	6961215578	refactor code to separate functionality into different files	2019-05-06 10:56:56 +02:00
svlandeg	f5190267e7	run only 100M of WP data as training dataset (9%)	2019-05-03 18:09:09 +02:00
svlandeg	4e929600e5	fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now)	2019-05-03 17:37:47 +02:00
svlandeg	34600c92bd	try catch per article to ensure the pipeline goes on	2019-05-03 15:10:09 +02:00
svlandeg	bbcb9da466	creating training data with clean WP texts and QID entities true/false	2019-05-03 10:44:29 +02:00
svlandeg	cba9680d13	run NER on clean WP text and link to gold-standard entity IDs	2019-05-02 17:24:52 +02:00
svlandeg	581dc9742d	parsing clean text from WP articles to use as input data for NER and NEL	2019-05-02 17:09:56 +02:00
svlandeg	8353552191	cleanup	2019-05-01 23:26:16 +02:00
svlandeg	1ae41daaa9	allow small rounding errors	2019-05-01 23:05:40 +02:00
svlandeg	3629a52ede	reading all persons in wikidata	2019-05-01 01:00:59 +02:00
svlandeg	60b54ae8ce	bulk entity writing and experiment with regex wikidata reader to speed up processing	2019-05-01 00:00:38 +02:00
svlandeg	653b7d9c87	calculate entity raw counts offline to speed up KB construction	2019-04-30 11:39:42 +02:00
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00
svlandeg	387263d618	simplify chains	2019-04-29 13:58:07 +02:00
svlandeg	54d0cea062	unit test for KB serialization	2019-04-24 23:52:34 +02:00
svlandeg	3e0cb69065	KB aliases to and from file	2019-04-24 20:24:24 +02:00
svlandeg	ad6c5e581c	writing and reading number of entries to/from header	2019-04-24 15:31:44 +02:00
svlandeg	6e3223f234	bulk loading in proper order of entity indices	2019-04-24 11:26:38 +02:00
svlandeg	694fea597a	dumping all entryC entries + (inefficient) reading back in	2019-04-23 18:36:50 +02:00
svlandeg	8e70a564f1	custom reader and writer for _EntryC fields (first stab at it - not complete)	2019-04-23 16:33:40 +02:00
svlandeg	004e5e7d1c	little fixes	2019-04-19 14:24:02 +02:00
svlandeg	9a8197185b	fix alias capitalization	2019-04-18 22:37:50 +02:00
svlandeg	9f308eb5dc	fixes for prior prob and linking wikidata IDs with wikipedia titles	2019-04-18 16:14:25 +02:00
svlandeg	10ee8dfea2	poc with few entities and collecting aliases from the WP links	2019-04-18 14:12:17 +02:00
svlandeg	6763e025e1	parse wp dump for links to determine prior probabilities	2019-04-15 11:41:57 +02:00
svlandeg	3163331b1e	wikipedia dump parser and mediawiki format regex cleanup	2019-04-14 21:52:01 +02:00
svlandeg	b31a390a9a	reading types, claims and sitelinks	2019-04-11 21:42:44 +02:00
svlandeg	6e997be4b4	reading wikidata descriptions and aliases	2019-04-11 21:08:22 +02:00
svlandeg	9a7d534b1b	enable nogil for cython functions in kb.pxd	2019-04-10 17:25:10 +02:00
svlandeg	61a33f55d2	little fixes	2019-04-10 16:06:09 +02:00
Ines Montani	6ae3b5699e	Make sure path is string (resolves #3546 )	2019-04-08 12:53:41 +02:00
Ines Montani	d0f5e015cb	Auto-format	2019-04-08 12:53:16 +02:00
pierremonico	0d26bfe677	Removes duplicate in table (#3550 ) * Removes duplicate in table Just fixing typos. * Remove newline Co-authored-by: Ines Montani <ines@ines.io>	2019-04-08 10:30:42 +02:00
Piero Molino	5198aa4ae6	Added Ludwig among the projects (#3548 ) [ci skip] * Added Ludwig among the projects * Create w4nderlust.md * Add Uber to logo wall	2019-04-07 13:01:26 +02:00
Dobita21	8bf6967eb7	Update Thai stop words (#3545 ) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability	2019-04-05 12:06:38 +02:00
jeannefukumaru	f67d881b30	fix typos in tag_map flagged by `python -m debug-data` (#3542 ) ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. Co-authored-by: Ines Montani <ines@ines.io>	2019-04-05 12:06:09 +02:00
Ines Montani	cd21778bef	Merge pull request #3539 from jeannefukumaru/master Added tags previously missing from Indonesian `tag_map.py`	2019-04-04 11:57:03 +02:00
Jeanne Choo	b6c9807431	Merge remote-tracking branch 'upstream/master'	2019-04-04 14:21:50 +08:00

1 2 3 4 5 ...

10050 Commits