svlandeg
9d089c0410
grouping clusters of instances per doc+mention
2019-05-09 18:11:49 +02:00
svlandeg
c6ca8649d7
first stab at model - not functional yet
2019-05-09 17:23:19 +02:00
svlandeg
9f33732b96
using entity descriptions and article texts as input embedding vectors for training
2019-05-07 16:03:42 +02:00
svlandeg
7e348d7f7f
baseline evaluation using highest-freq candidate
2019-05-06 15:13:50 +02:00
svlandeg
6961215578
refactor code to separate functionality into different files
2019-05-06 10:56:56 +02:00
svlandeg
f5190267e7
run only 100M of WP data as training dataset (9%)
2019-05-03 18:09:09 +02:00
svlandeg
4e929600e5
fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now)
2019-05-03 17:37:47 +02:00
svlandeg
34600c92bd
try catch per article to ensure the pipeline goes on
2019-05-03 15:10:09 +02:00
svlandeg
bbcb9da466
creating training data with clean WP texts and QID entities true/false
2019-05-03 10:44:29 +02:00
svlandeg
cba9680d13
run NER on clean WP text and link to gold-standard entity IDs
2019-05-02 17:24:52 +02:00
svlandeg
581dc9742d
parsing clean text from WP articles to use as input data for NER and NEL
2019-05-02 17:09:56 +02:00
svlandeg
8353552191
cleanup
2019-05-01 23:26:16 +02:00
svlandeg
1ae41daaa9
allow small rounding errors
2019-05-01 23:05:40 +02:00
svlandeg
3629a52ede
reading all persons in wikidata
2019-05-01 01:00:59 +02:00
svlandeg
60b54ae8ce
bulk entity writing and experiment with regex wikidata reader to speed up processing
2019-05-01 00:00:38 +02:00
svlandeg
653b7d9c87
calculate entity raw counts offline to speed up KB construction
2019-04-30 11:39:42 +02:00
svlandeg
19e8f339cb
deduce entity freq from WP corpus and serialize vocab in WP test
2019-04-29 17:37:29 +02:00
svlandeg
387263d618
simplify chains
2019-04-29 13:58:07 +02:00
svlandeg
54d0cea062
unit test for KB serialization
2019-04-24 23:52:34 +02:00
svlandeg
3e0cb69065
KB aliases to and from file
2019-04-24 20:24:24 +02:00
svlandeg
ad6c5e581c
writing and reading number of entries to/from header
2019-04-24 15:31:44 +02:00
svlandeg
6e3223f234
bulk loading in proper order of entity indices
2019-04-24 11:26:38 +02:00
svlandeg
694fea597a
dumping all entryC entries + (inefficient) reading back in
2019-04-23 18:36:50 +02:00
svlandeg
8e70a564f1
custom reader and writer for _EntryC fields (first stab at it - not complete)
2019-04-23 16:33:40 +02:00
svlandeg
004e5e7d1c
little fixes
2019-04-19 14:24:02 +02:00
svlandeg
9a8197185b
fix alias capitalization
2019-04-18 22:37:50 +02:00
svlandeg
9f308eb5dc
fixes for prior prob and linking wikidata IDs with wikipedia titles
2019-04-18 16:14:25 +02:00
svlandeg
10ee8dfea2
poc with few entities and collecting aliases from the WP links
2019-04-18 14:12:17 +02:00
svlandeg
6763e025e1
parse wp dump for links to determine prior probabilities
2019-04-15 11:41:57 +02:00
svlandeg
3163331b1e
wikipedia dump parser and mediawiki format regex cleanup
2019-04-14 21:52:01 +02:00
svlandeg
b31a390a9a
reading types, claims and sitelinks
2019-04-11 21:42:44 +02:00
svlandeg
6e997be4b4
reading wikidata descriptions and aliases
2019-04-11 21:08:22 +02:00
svlandeg
9a7d534b1b
enable nogil for cython functions in kb.pxd
2019-04-10 17:25:10 +02:00
svlandeg
61a33f55d2
little fixes
2019-04-10 16:06:09 +02:00
Ines Montani
6ae3b5699e
Make sure path is string ( resolves #3546 )
2019-04-08 12:53:41 +02:00
Ines Montani
d0f5e015cb
Auto-format
2019-04-08 12:53:16 +02:00
pierremonico
0d26bfe677
Removes duplicate in table ( #3550 )
...
* Removes duplicate in table
Just fixing typos.
* Remove newline
Co-authored-by: Ines Montani <ines@ines.io>
2019-04-08 10:30:42 +02:00
Piero Molino
5198aa4ae6
Added Ludwig among the projects ( #3548 ) [ci skip]
...
* Added Ludwig among the projects
* Create w4nderlust.md
* Add Uber to logo wall
2019-04-07 13:01:26 +02:00
Dobita21
8bf6967eb7
Update Thai stop words ( #3545 )
...
* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
2019-04-05 12:06:38 +02:00
jeannefukumaru
f67d881b30
fix typos in tag_map flagged by python -m debug-data
( #3542 )
...
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
Co-authored-by: Ines Montani <ines@ines.io>
2019-04-05 12:06:09 +02:00
Ines Montani
cd21778bef
Merge pull request #3539 from jeannefukumaru/master
...
Added tags previously missing from Indonesian `tag_map.py`
2019-04-04 11:57:03 +02:00
Jeanne Choo
b6c9807431
Merge remote-tracking branch 'upstream/master'
2019-04-04 14:21:50 +08:00
Jeanne Choo
80e15af76c
fixed tag_map.py merge conflict
2019-04-04 14:18:27 +08:00
jeannefukumaru
eba4f77526
Merge pull request #2 from jeannefukumaru/update_indonesian_tag_map
...
updated tag map with missing tags
2019-04-04 06:49:04 +08:00
jeannefukumaru
876ce01567
updated tag map with missing tags
2019-04-03 23:09:11 +08:00
jeannefukumaru
99e04c4ce2
Merge pull request #1 from jeannefukumaru/added-indonesian-tag-map
...
Added indonesian tag map
2019-04-03 23:05:05 +08:00
Ines Montani
4faf62d515
Merge pull request #3530 from svlandeg/fix/issue_3521
...
Allow English stopwords with any type of apostrophe
2019-04-03 14:14:03 +02:00
Yves Peirsman
951825532c
Improved Dutch language resources and Dutch lemmatization ( #3409 )
...
* Improved Dutch language resources and Dutch lemmatization
* Fix conftest
* Update punctuation.py
* Auto-format
* Format and fix tests
* Remove unused test file
* Re-add deleted test
* removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains
* Cleaner lemmatization files
2019-04-03 14:13:26 +02:00
svlandeg
4ff786e113
addressed all comments by Ines
2019-04-03 13:50:33 +02:00
Ines Montani
6a4575a56c
Don't make "settings" or "title" required in displaCy data ( closes #3531 )
2019-04-03 10:13:16 +02:00