Wolfgang Seeker
eae35e9b27
add tokenizer files for German, add/change code to train German pos tagger
...
- add files to specify rules for German tokenization
- change generate_specials.py to generate from an external file (abbrev.de.tab)
- copy gazetteer.json from lang_data/en/
- init_model.py
- change doc freq threshold to 0
- add train_german_tagger.py
- expects conll09-formatted input
2016-02-18 13:24:20 +01:00
Matthew Honnibal
7cbff48ace
* Set the German lemma rules to be an empty JSON object
2016-02-02 22:30:51 +01:00
Matthew Honnibal
d0f06c5cc4
* Add missing tags to the German tag map
2016-02-02 22:30:22 +01:00
Matthew Honnibal
6c633f2edc
Fix Issue #243 : Incorrect gazetteer entry
2016-01-30 06:58:29 +11:00
Matthew Honnibal
4b4eec8b47
* Fix Issue #201 : Tokenization of there'll
2015-12-29 18:09:09 +01:00
Matthew Honnibal
e8bd92f1e7
* Fix lemma of let's, re Issue #177
2015-11-13 06:42:23 +11:00
Matthew Honnibal
726bb648da
* Fix non-breaking space in specials.json
2015-10-19 12:46:11 +11:00
Matthew Honnibal
e39095da82
* Fix designation of non-breaking space in specials.json.
2015-10-19 12:39:03 +11:00
Matthew Honnibal
454c1996d0
* Add tokenizer rule to fix numeric range tokenization
2015-10-17 15:49:51 +11:00
Matthew Honnibal
7488821677
* Map NIL to empty string in tag map
2015-10-10 22:09:50 +11:00
Matthew Honnibal
4bbd1388bd
* Whitespace
2015-10-10 16:03:48 +11:00
Matthew Honnibal
bdcb8d695c
* Add non-breaking space to specials.json
2015-10-10 15:54:06 +11:00
Matthew Honnibal
c12d36d5f4
* Fix quote marks in lemma_rules
2015-10-10 15:03:36 +11:00
Matthew Honnibal
57b3cd4661
* Add smart-quotes to lemma rules
2015-10-10 14:06:46 +11:00
Matthew Honnibal
7e7f28e1fd
* Add smart-quote possessive marker in generate_specials
2015-10-10 14:06:09 +11:00
Matthew Honnibal
a510858f5a
* Pretty-print specials.json, and add the em dash
2015-10-09 11:07:45 +02:00
Matthew Honnibal
49600a44a8
* Fix trailing comma in lemma_rules.json
2015-10-09 11:06:57 +02:00
Matthew Honnibal
0e92e8574a
* Fix pos tag in em-dash in specials
2015-10-09 11:06:37 +02:00
Matthew Honnibal
d341443282
* Remove em-dash from lemma rules. Handle instead in specials.
2015-10-09 10:27:13 +02:00
Matthew Honnibal
b6047afe4c
* Fix punctuation lemma rules, to resolve Issue #130
2015-10-09 10:25:37 +02:00
Matthew Honnibal
393a13d1af
* Add unicode em dash to specials.json, so that we can control what POS tag it gets. This way we can prevent sentence boundary detection errors, to address Issue #130 .
2015-10-09 19:24:33 +11:00
Matthew Honnibal
1490feda29
* Make generate_specials pretty-print the specials.json file
2015-10-09 19:23:47 +11:00
Matthew Honnibal
1842a53e73
* Lemmatize smart quotes as plain quotes
2015-10-09 19:09:36 +11:00
Matthew Honnibal
5332c0b697
* Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130
2015-10-09 18:54:40 +11:00
Matthew Honnibal
e3e8994368
* Patch italian tag map
2015-10-08 14:00:13 +11:00
Matthew Honnibal
2d68f75b6a
* Fix identity tag map
2015-10-08 13:59:56 +11:00
Matthew Honnibal
095831e5bf
* Start adding auxiliaries to morphs.json
2015-09-27 16:56:34 +10:00
Matthew Honnibal
c579b6b96c
* Update English morphs.json
2015-09-24 22:38:41 +10:00
Matthew Honnibal
3b3547251c
* Fix Issue #102 : DT tag was mapped to DET.
2015-09-24 18:38:47 +10:00
Matthew Honnibal
be4848fbcb
* Update morphs.json with universal dependencies/interset morphological features
2015-09-24 00:59:42 +10:00
Henning Peters
911de2ae49
add overseen (?) char
2015-09-22 12:29:47 +02:00
Henning Peters
9ecb98f30e
basic german rules
2015-09-22 11:56:29 +02:00
Matthew Honnibal
b9e31dc245
* Bug fix to gazetteer.json
2015-09-10 14:50:44 +02:00
Matthew Honnibal
623329b19a
Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop
2015-09-08 14:27:01 +02:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
dbf8dce109
Merge branch 'gaz' of ssh://github.com/honnibal/spaCy into gaz
2015-09-06 18:44:14 +02:00
Matthew Honnibal
577418986a
* Add draft Italian stuff
2015-09-06 18:44:10 +02:00
Matthew Honnibal
80a66c0159
* Add draft finnish stuff
2015-09-06 18:43:44 +02:00
Matthew Honnibal
b3703836f9
* Add en lemma rules
2015-09-06 17:56:11 +02:00
Matthew Honnibal
238b2f533b
* Add lemma rules
2015-09-06 17:55:53 +02:00
Matthew Honnibal
c9f2082e3c
* Fix compilation error in en/tag_map.json
2015-09-06 17:54:51 +02:00
Matthew Honnibal
0af139e183
* Tagger training now working. Still need to test load/save of model. Morphology still broken.
2015-08-27 09:16:11 +02:00
Matthew Honnibal
56c4e07a59
Update gazetteer.json
2015-08-27 08:53:48 +10:00
Matthew Honnibal
494da25872
* Refactor for more universal spacy
2015-08-26 19:13:50 +02:00
jxs8172
85f01c5e16
Add contributor agreement. Add exception to 'it' so that 'its' and 'Its' isn't generated (its =/= it's)
2015-08-24 18:20:06 -04:00
jxs8172
5876248109
Add missing we've and hardcoded 's and 'S
2015-08-21 22:57:47 -04:00
jxs8172
a5e0a0073b
Add a script to generate the specials.json file, to take care of handling uppercase and missing apostrophe contractions
2015-08-21 22:39:33 -04:00
Matthew Honnibal
b27bd18d6e
* Add spaCy to gazetteer
2015-08-08 23:30:49 +02:00
Matthew Honnibal
855af087fc
* Fix gazetteer.json
2015-08-06 17:27:51 +02:00
Matthew Honnibal
0e098815cc
* Expand gazetteer with some of the errors from the reddit parse
2015-08-06 17:13:27 +02:00