Ines Montani
377ab1cffb
Improve Token.prob and Lexeme.prob docs ( resolves #3701 )
2019-05-11 15:22:34 +02:00
Aaron Kub
914f4b2938
fixing regex matcher examples ( #3708 ) ( #3719 )
2019-05-10 14:24:24 +02:00
Aaron Kub
719a15f23d
fixing regex matcher examples ( #3708 ) ( #3719 )
2019-05-10 14:23:52 +02:00
Luca Dorigo
82d034f976
Update glossary.py to match information found in documentation ( #3704 ) (closes ##3679)
...
* Update glossary.py to match information found in documentation
I used regexes to add any dependency tag that was in the documentation but not in the glossary. Solves #3679 👍
* Adds forgotten colon
2019-05-10 14:23:20 +02:00
Wannaphong Phatthiyaphaibun
5a14a13f64
fix thai bug ( #3693 )
...
fix tokenize for pythainlp
2019-05-10 14:21:34 +02:00
Luca Dorigo
2663f4133c
Submit contributor agreement ( #3705 )
2019-05-10 14:19:18 +02:00
Ines Montani
65b55f1aaa
Add version tag to --base-model
argument ( closes #3720 )
2019-05-10 14:06:47 +02:00
Ines Montani
f256bfbcc4
Add version tag to --base-model
argument ( closes #3720 )
2019-05-10 14:06:06 +02:00
svlandeg
b6d788064a
some first experiments with different architectures and metrics
2019-05-10 12:53:14 +02:00
svlandeg
9d089c0410
grouping clusters of instances per doc+mention
2019-05-09 18:11:49 +02:00
svlandeg
c6ca8649d7
first stab at model - not functional yet
2019-05-09 17:23:19 +02:00
Ines Montani
61829f1e79
Fix typo
2019-05-09 15:36:29 +02:00
richardpaulhudson
a1e07f0d14
Request to include Holmes in spaCy Universe ( #3685 )
...
* Request to add Holmes to spaCy Universe
Dear spaCy team, I would be grateful if you would consider my Python library Holmes for inclusion in the spaCy Universe. Holmes transforms the syntactic structures delivered by spaCy into semantic structures that, together with various other techniques including ontological matching and word embeddings, serve as the basis for information extraction. Holmes supports several use cases including chatbot, structured search, topic matching and supervised document classification. I had the basic idea for Holmes around 15 years ago and now spaCy has made it possible to build an implementation that is stable and fast enough to actually be of use - thank you! At present Holmes supports English and German (I am based in Munich) but could easily be extended to support any other language with a spaCy model.
* Added
2019-05-08 02:42:03 +02:00
Ines Montani
505c9e0e19
Add util.filter_spans helper ( #3686 )
2019-05-08 02:33:40 +02:00
svlandeg
9f33732b96
using entity descriptions and article texts as input embedding vectors for training
2019-05-07 16:03:42 +02:00
F0rge1cE
dd1e6b0bc6
Fix offset bug in loading pre-trained word2vec. ( #3689 )
...
* Fix offset bug in loading pre-trained word2vec.
* add contributor agreement
2019-05-06 23:00:38 +02:00
Bram Vanroy
4762f56062
Re-added Universe readme ( #3688 ) ( closes #3680 )
2019-05-06 21:10:58 +02:00
Bram Vanroy
8e6f8deaf6
Re-added Universe readme ( #3688 ) ( closes #3680 )
2019-05-06 21:08:01 +02:00
Ines Montani
78cb807a9a
Auto-format [ci skip]
2019-05-06 16:58:29 +02:00
svlandeg
7e348d7f7f
baseline evaluation using highest-freq candidate
2019-05-06 15:13:50 +02:00
Ines Montani
dd153b2b33
Simplify helper (see #3681 ) [ci skip]
2019-05-06 15:13:10 +02:00
Ines Montani
f8fce6c03c
Fix typo (see #3681 )
2019-05-06 15:02:11 +02:00
Ines Montani
f2a56c1b56
Rewrite example to use Retokenizer ( resolves #3681 )
...
Also add helper to filter spans
2019-05-06 14:51:18 +02:00
svlandeg
6961215578
refactor code to separate functionality into different files
2019-05-06 10:56:56 +02:00
Brad Jascob
955b95cb8b
Fix inconsistant lemmatizer issue #3484 ( #3646 )
...
* Fix inconsistant lemmatizer issue #3484
* Remove test case
2019-05-04 18:16:03 +02:00
svlandeg
f5190267e7
run only 100M of WP data as training dataset (9%)
2019-05-03 18:09:09 +02:00
svlandeg
4e929600e5
fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now)
2019-05-03 17:37:47 +02:00
svlandeg
34600c92bd
try catch per article to ensure the pipeline goes on
2019-05-03 15:10:09 +02:00
Ines Montani
b4d142e3c4
Adjust wording and formatting [ci skip]
2019-05-03 12:00:31 +02:00
Ines Montani
04658ebbb2
Relax jsonschema pin ( closes #3628 )
2019-05-03 11:58:58 +02:00
d5555
ba4bcbf285
Update universe.json ( #3653 ) [ci skip]
...
* Update universe.json
* Update universe.json
2019-05-03 11:50:12 +02:00
svlandeg
bbcb9da466
creating training data with clean WP texts and QID entities true/false
2019-05-03 10:44:29 +02:00
svlandeg
cba9680d13
run NER on clean WP text and link to gold-standard entity IDs
2019-05-02 17:24:52 +02:00
svlandeg
581dc9742d
parsing clean text from WP articles to use as input data for NER and NEL
2019-05-02 17:09:56 +02:00
svlandeg
8353552191
cleanup
2019-05-01 23:26:16 +02:00
svlandeg
1ae41daaa9
allow small rounding errors
2019-05-01 23:05:40 +02:00
Dobita21
f95ecedd83
Add Thai lex_attrs ( #3655 )
...
* test sPacy commit to git fri 04052019 10:54
* change Data format from my format to master format
* ทัทั้งนี้ ---> ทั้งนี้
* delete stop_word translate from Eng
* Adjust formatting and readability
* add Thai norm_exception
* Add Dobita21 SCA
* editรึ : หรือ,
* Update Dobita21.md
* Auto-format
* Integrate norms into language defaults
* add acronym and some norm exception words
* add lex_attrs
* Add lexical attribute getters into the language defaults
* fix LEX_ATTRS
Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-05-01 12:03:14 +02:00
张晓飞
ba1ff00370
update response after calling add_pipe ( #3661 )
...
* update response after calling add_pipe
component:print_info is appened in the last, so need show it at the end of pipeline
* Create henry860916.md
2019-05-01 12:02:18 +02:00
BreakBB
8952004dfc
Update French example sents and add two German stop words ( #3662 )
...
* Update french example sentences
* Add 'anderem' and 'ihren' to German stop words
2019-05-01 12:01:35 +02:00
svlandeg
3629a52ede
reading all persons in wikidata
2019-05-01 01:00:59 +02:00
svlandeg
60b54ae8ce
bulk entity writing and experiment with regex wikidata reader to speed up processing
2019-05-01 00:00:38 +02:00
svlandeg
653b7d9c87
calculate entity raw counts offline to speed up KB construction
2019-04-30 11:39:42 +02:00
Ramiro Gómez
8ee4100f8f
Remove dangling M ( #3657 )
...
I assume this is a typo. Sorry if it has a meaning that I'm not aware of.
2019-04-29 19:44:43 +02:00
Amit Chaudhary
167d63af31
Fix broken link to Dive Into Python 3 website ( #3656 )
...
* Fix broken link to Dive Into Python 3 website
* Sign spaCy Contributor Agreement
2019-04-29 19:44:00 +02:00
Ramiro Gómez
e7e5999ddc
Create yaph.md so I can contribute ( #3658 )
2019-04-29 19:43:06 +02:00
svlandeg
19e8f339cb
deduce entity freq from WP corpus and serialize vocab in WP test
2019-04-29 17:37:29 +02:00
svlandeg
387263d618
simplify chains
2019-04-29 13:58:07 +02:00
Brad Jascob
6fcafcc564
Doc changes for local website setup ( #3651 )
2019-04-27 13:28:23 +02:00
Ivan Tham
fa94f83697
Improve redundant variable name ( #3643 )
...
* Improve redundant variable name
* Apply suggestions from code review
Co-Authored-By: pickfire <pickfire@riseup.net>
2019-04-26 16:50:14 +02:00
Ines Montani
bf92625ede
Update from master
2019-04-26 13:19:50 +02:00