Commit Graph

6004 Commits

Author SHA1 Message Date
svlandeg
bee23cd8af try Tok2Vec instead of SpacyVectors 2019-06-25 16:09:22 +02:00
svlandeg
8608685543 ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
svlandeg
58a5b40ef6 clean up duplicate code 2019-06-24 15:19:58 +02:00
svlandeg
ddc73b11a9 fix unicode literals 2019-06-24 12:58:18 +02:00
svlandeg
f4af47ce4a Merge branch 'feature/nel-wiki' of https://github.com/svlandeg/spaCy into feature/nel-wiki 2019-06-24 10:57:07 +02:00
svlandeg
b58bace84b small fixes 2019-06-24 10:55:04 +02:00
Ines Montani
872121955c Update error code 2019-06-20 10:35:51 +02:00
Ines Montani
e1be80e3ec Merge branch 'master' into pr/3864 2019-06-20 10:35:37 +02:00
Björn Böing
ebf5a04d6c Update pretrain docs and add unsupported loss_func error (#3860)
* Add error to `get_vectors_loss` for unsupported loss function of `pretrain`

* Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs.

* Add missing quotation marks
2019-06-20 10:30:44 +02:00
svlandeg
b76a43bee4 unicode strings 2019-06-19 13:26:33 +02:00
svlandeg
0b0959b363 UTF8 encoding 2019-06-19 13:11:39 +02:00
svlandeg
cc9ae28a52 custom error and warning messages 2019-06-19 12:35:26 +02:00
svlandeg
791327e3c5 Merge remote-tracking branch 'upstream/master' into feature/nel-wiki 2019-06-19 09:44:05 +02:00
svlandeg
a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg
478305cd3f small tweaks and documentation 2019-06-18 18:38:09 +02:00
svlandeg
0d177c1146 clean up code, remove old code, move to bin 2019-06-18 13:20:40 +02:00
svlandeg
ffae7d3555 sentence encoder only (removing article/mention encoder) 2019-06-18 00:05:47 +02:00
Kabir Khan
1e19f34e29 Add optional id property to EntityRuler patterns (#3591)
* Adding support for entity_id in EntityRuler pipeline component

* Adding Spacy Contributor aggreement

* Updating EntityRuler to use string.format instead of f strings

* Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity.

* Fixing tests

* Remove custom extension entity_id and use built in ent_id token attribute.

* Changing entity_id to ent_id for consistent naming

* entity_ids => ent_ids

* Removing kb, cleaning up tests, making util functions private, use rsplit instead of split
2019-06-16 13:29:04 +02:00
Suraj Rajan
46c78d0a41 Dependency tree pattern matcher (#3465)
* Functional dependency tree pattern matcher

* Tests fail due to inconsistent behaviour

* Renamed dependencymatcher and added optimizations
2019-06-16 13:25:32 +02:00
BreakBB
d8573ee715 Update error raising for CLI pretrain to fix #3840 (#3843)
* Add check for empty input file to CLI pretrain

* Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key

* Skip empty values for correct pretrain keys and log a counter as warning

* Add tests for CLI pretrain core function make_docs.

* Add a short hint for the `tokens` key to the CLI pretrain docs

* Add success message to CLI pretrain

* Update model loading to fix the tests

* Skip empty values and do not create docs out of it
2019-06-16 13:22:57 +02:00
svlandeg
b312f2d0e7 redo training data to be independent of KB and entity-level instead of doc-level 2019-06-14 15:55:26 +02:00
Azagh3l
5accfbb938 Update exemples.py (#3838)
Added missing hyphen and accent.
2019-06-14 09:31:05 +02:00
svlandeg
78dd3e11da write entity linking pipe to file and keep vocab consistent between kb and nlp 2019-06-13 16:25:39 +02:00
svlandeg
b12001f368 small fixes 2019-06-12 22:05:53 +02:00
Ines Montani
f35ce09776 Add regression test for #3839 2019-06-12 13:38:30 +02:00
Ines Montani
aae9034492 Tidy up [ci skip] 2019-06-12 13:38:23 +02:00
svlandeg
6521cfa132 speeding up training 2019-06-12 13:37:05 +02:00
Motoki Wu
9c064e6ad9 Add resume logic to spacy pretrain (#3652)
* Added ability to resume training

* Add to readmee

* Remove duplicate entry
2019-06-12 13:29:23 +02:00
svlandeg
fe1ed432ef eval on dev set, varying combo's of prior and context scores 2019-06-11 11:40:58 +02:00
Azagh3l
eb3e4263ee Update lex_attrs.py (#3835)
Corrected typos, added french (from France) versions of some numbers.
2019-06-11 10:59:16 +02:00
svlandeg
83dc7b46fd first tests with EL pipe 2019-06-10 21:25:26 +02:00
Matthew Honnibal
7f71cf0b02 Merge branch 'master' of https://github.com/explosion/spaCy 2019-06-07 20:41:00 +02:00
Matthew Honnibal
a931d72459 Add merge_subtokens as parser post-process. Re #3830 2019-06-07 20:40:41 +02:00
svlandeg
7de1ee69b8 training loop in proper pipe format 2019-06-07 15:55:10 +02:00
svlandeg
0486ccabfd introduce goldparse.links 2019-06-07 13:54:45 +02:00
svlandeg
a5c061f506 storing NEL training data in GoldParse objects 2019-06-07 12:58:42 +02:00
svlandeg
61f0e2af65 code cleanup 2019-06-06 20:22:14 +02:00
svlandeg
d8b435ceff pretraining description vectors and storing them in the KB 2019-06-06 19:51:27 +02:00
svlandeg
5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg
9abbd0899f separate entity encoder to get 64D descriptions 2019-06-05 00:09:46 +02:00
svlandeg
fb37cdb2d3 implementing el pipe in pipes.pyx (not tested yet) 2019-06-03 21:32:54 +02:00
intrafind
2bba2a3536 Fix for #3811 (#3815)
Corrected type of seed parameter.
2019-06-03 18:32:47 +02:00
svlandeg
d83a1e3052 Merge branch 'master' into feature/nel-wiki 2019-06-03 09:35:10 +02:00
Germán
86eb817b74 Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810) (closes #3803))
* (#3803) Spanish like_num returning false for number-like token

* (#3803) Spanish like_num now returning True for number-like token
2019-06-02 12:22:57 +02:00
Ines Montani
09e78b52cf Improve E024 text for incorrect GoldParse (closes #3558) 2019-06-01 14:37:27 +02:00
Ramanan Balakrishnan
26c37c5a4d fix all references to BILUO annotation format (#3797) 2019-05-31 12:19:19 +02:00
Ines Montani
a7fd42d937 Make jsonschema dependency optional (#3784) 2019-05-30 14:34:58 +02:00
Ujwal Narayan
ed7be3f64c Update norm_exceptions.py (#3778)
* Update norm_exceptions.py

Extended the Currency set to include Franc, Indian Rupee, Bangladeshi Taka, Korean Won, Mexican Dollar, and Egyptian Pound

* Fix formatting [ci skip]
2019-05-27 11:52:52 +02:00
estr4ng7d
604acb6ace Marathi Language Support (#3767)
* Adding Marathi language details and folder to it

* Adding few changes and running tests

* Adding few changes and running tests

* Update __init__.py

mh -> mr

* Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py

* mh -> mr
2019-05-24 14:29:42 +02:00
Ines Montani
7634812172 Document Language.evaluate 2019-05-24 14:06:36 +02:00