svlandeg
9f33732b96
using entity descriptions and article texts as input embedding vectors for training
2019-05-07 16:03:42 +02:00
F0rge1cE
dd1e6b0bc6
Fix offset bug in loading pre-trained word2vec. ( #3689 )
...
* Fix offset bug in loading pre-trained word2vec.
* add contributor agreement
2019-05-06 23:00:38 +02:00
Bram Vanroy
4762f56062
Re-added Universe readme ( #3688 ) ( closes #3680 )
2019-05-06 21:10:58 +02:00
Bram Vanroy
8e6f8deaf6
Re-added Universe readme ( #3688 ) ( closes #3680 )
2019-05-06 21:08:01 +02:00
Ines Montani
78cb807a9a
Auto-format [ci skip]
2019-05-06 16:58:29 +02:00
svlandeg
7e348d7f7f
baseline evaluation using highest-freq candidate
2019-05-06 15:13:50 +02:00
Ines Montani
dd153b2b33
Simplify helper (see #3681 ) [ci skip]
2019-05-06 15:13:10 +02:00
Ines Montani
f8fce6c03c
Fix typo (see #3681 )
2019-05-06 15:02:11 +02:00
Ines Montani
f2a56c1b56
Rewrite example to use Retokenizer ( resolves #3681 )
...
Also add helper to filter spans
2019-05-06 14:51:18 +02:00
svlandeg
6961215578
refactor code to separate functionality into different files
2019-05-06 10:56:56 +02:00
Brad Jascob
955b95cb8b
Fix inconsistant lemmatizer issue #3484 ( #3646 )
...
* Fix inconsistant lemmatizer issue #3484
* Remove test case
2019-05-04 18:16:03 +02:00
svlandeg
f5190267e7
run only 100M of WP data as training dataset (9%)
2019-05-03 18:09:09 +02:00
svlandeg
4e929600e5
fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now)
2019-05-03 17:37:47 +02:00
svlandeg
34600c92bd
try catch per article to ensure the pipeline goes on
2019-05-03 15:10:09 +02:00
Ines Montani
b4d142e3c4
Adjust wording and formatting [ci skip]
2019-05-03 12:00:31 +02:00
Ines Montani
04658ebbb2
Relax jsonschema pin ( closes #3628 )
2019-05-03 11:58:58 +02:00
d5555
ba4bcbf285
Update universe.json ( #3653 ) [ci skip]
...
* Update universe.json
* Update universe.json
2019-05-03 11:50:12 +02:00
svlandeg
bbcb9da466
creating training data with clean WP texts and QID entities true/false
2019-05-03 10:44:29 +02:00
svlandeg
cba9680d13
run NER on clean WP text and link to gold-standard entity IDs
2019-05-02 17:24:52 +02:00
svlandeg
581dc9742d
parsing clean text from WP articles to use as input data for NER and NEL
2019-05-02 17:09:56 +02:00
svlandeg
8353552191
cleanup
2019-05-01 23:26:16 +02:00
svlandeg
1ae41daaa9
allow small rounding errors
2019-05-01 23:05:40 +02:00
Dobita21
f95ecedd83
Add Thai lex_attrs ( #3655 )
...
* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
* add Thai norm_exception
* Add Dobita21 SCA
* editรึ : หรือ,
* Update Dobita21.md
* Auto-format
* Integrate norms into language defaults
* add acronym and some norm exception words
* add lex_attrs
* Add lexical attribute getters into the language defaults
* fix LEX_ATTRS
Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-05-01 12:03:14 +02:00
张晓飞
ba1ff00370
update response after calling add_pipe ( #3661 )
...
* update response after calling add_pipe
component:print_info is appened in the last, so need show it at the end of pipeline
* Create henry860916.md
2019-05-01 12:02:18 +02:00
BreakBB
8952004dfc
Update French example sents and add two German stop words ( #3662 )
...
* Update french example sentences
* Add 'anderem' and 'ihren' to German stop words
2019-05-01 12:01:35 +02:00
svlandeg
3629a52ede
reading all persons in wikidata
2019-05-01 01:00:59 +02:00
svlandeg
60b54ae8ce
bulk entity writing and experiment with regex wikidata reader to speed up processing
2019-05-01 00:00:38 +02:00
svlandeg
653b7d9c87
calculate entity raw counts offline to speed up KB construction
2019-04-30 11:39:42 +02:00
Ramiro Gómez
8ee4100f8f
Remove dangling M ( #3657 )
...
I assume this is a typo. Sorry if it has a meaning that I'm not aware of.
2019-04-29 19:44:43 +02:00
Amit Chaudhary
167d63af31
Fix broken link to Dive Into Python 3 website ( #3656 )
...
* Fix broken link to Dive Into Python 3 website
* Sign spaCy Contributor Agreement
2019-04-29 19:44:00 +02:00
Ramiro Gómez
e7e5999ddc
Create yaph.md so I can contribute ( #3658 )
2019-04-29 19:43:06 +02:00
svlandeg
19e8f339cb
deduce entity freq from WP corpus and serialize vocab in WP test
2019-04-29 17:37:29 +02:00
svlandeg
387263d618
simplify chains
2019-04-29 13:58:07 +02:00
Brad Jascob
6fcafcc564
Doc changes for local website setup ( #3651 )
2019-04-27 13:28:23 +02:00
Ivan Tham
fa94f83697
Improve redundant variable name ( #3643 )
...
* Improve redundant variable name
* Apply suggestions from code review
Co-Authored-By: pickfire <pickfire@riseup.net>
2019-04-26 16:50:14 +02:00
Ines Montani
bf92625ede
Update from master
2019-04-26 13:19:50 +02:00
Ines Montani
dc87fb805d
Merge branch 'master' of https://github.com/explosion/spaCy
2019-04-26 13:17:57 +02:00
Ines Montani
62060ae9c6
Merge branch 'spacy.io'
2019-04-26 13:17:52 +02:00
Brad Jascob
9afa0d6723
Update Universe Website for pyInflect ( #3641 )
2019-04-26 13:17:36 +02:00
svlandeg
54d0cea062
unit test for KB serialization
2019-04-24 23:52:34 +02:00
svlandeg
3e0cb69065
KB aliases to and from file
2019-04-24 20:24:24 +02:00
svlandeg
ad6c5e581c
writing and reading number of entries to/from header
2019-04-24 15:31:44 +02:00
svlandeg
6e3223f234
bulk loading in proper order of entity indices
2019-04-24 11:26:38 +02:00
Ines Montani
db7c0dbfd6
Update seo.js
2019-04-23 18:39:30 +02:00
svlandeg
694fea597a
dumping all entryC entries + (inefficient) reading back in
2019-04-23 18:36:50 +02:00
svlandeg
8e70a564f1
custom reader and writer for _EntryC fields (first stab at it - not complete)
2019-04-23 16:33:40 +02:00
Dobita21
721e1fc86c
update norm_exceptions ( #3627 )
...
* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
* add Thai norm_exception
* Add Dobita21 SCA
* editรึ : หรือ,
* Update Dobita21.md
* Auto-format
* Integrate norms into language defaults
* add acronym and some norm exception words
2019-04-23 12:48:03 +02:00
Ines Montani
ec0d840ab5
Document early stopping
2019-04-22 14:31:32 +02:00
Ines Montani
e0f487f904
Rename early_stopping_iter to n_early_stopping
2019-04-22 14:31:25 +02:00
Ines Montani
9767427669
Auto-format
2019-04-22 14:31:11 +02:00