Commit Graph

45 Commits

Author SHA1 Message Date
Wolfgang Seeker
eae35e9b27 add tokenizer files for German, add/change code to train German pos tagger
- add files to specify rules for German tokenization
- change generate_specials.py to generate from an external file (abbrev.de.tab)
- copy gazetteer.json from lang_data/en/

- init_model.py
	- change doc freq threshold to 0
- add train_german_tagger.py
	- expects conll09-formatted input
2016-02-18 13:24:20 +01:00
Matthew Honnibal
2348a08481 * Load/dump strings with a json file, instead of the hacky strings file we were using. 2015-10-22 21:13:03 +11:00
Matthew Honnibal
92f750cf8b * Use a gzipped frequencies file in init_model 2015-10-11 06:59:44 +02:00
Matthew Honnibal
064bd69ad0 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-10 16:03:48 +11:00
Matthew Honnibal
83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
alvations
8caedba42a caught more codecs.open -> io.open 2015-09-30 20:20:09 +02:00
Matthew Honnibal
1ae55cb63a * Copy tag_map.json in init_model 2015-09-12 05:54:02 +02:00
Matthew Honnibal
5ad4527c42 * Rename Deutsch to German 2015-09-06 20:18:58 +02:00
Matthew Honnibal
950ce36660 * Update init model 2015-09-06 17:51:30 +02:00
Matthew Honnibal
b6b1e1aa12 * Add link for Finnish model 2015-08-27 10:26:02 +02:00
Matthew Honnibal
dc13edd7cb * Refactor init_model to accomodate other languages 2015-08-26 19:14:05 +02:00
Matthew Honnibal
bbf07ac253 * Cut down init_model to work on more languages 2015-08-24 01:05:20 +02:00
Matthew Honnibal
3ecacb9635 * Copy gazetteer file in init_model 2015-08-06 16:07:23 +02:00
Matthew Honnibal
174ed1ad20 * Tighten the frequency filter in init_model 2015-07-27 21:44:51 +02:00
Matthew Honnibal
6047f2aa35 * Fix path to freqs.txt 2015-07-27 02:22:35 +02:00
Matthew Honnibal
0368889d6c * Support gzipped frequencies in init_model 2015-07-26 22:39:22 +02:00
Matthew Honnibal
c4f20847da * Fix init_model for travis tests 2015-07-26 14:03:30 +02:00
Matthew Honnibal
09312b9353 * Fix init_model for travis tests 2015-07-26 13:55:47 +02:00
Matthew Honnibal
90ad717dc4 * Update default freq thresholds in init_model 2015-07-26 01:41:17 +02:00
Matthew Honnibal
6a5e035a48 * Ensure data files are copied for tokenizer in init_model 2015-07-26 01:36:19 +02:00
Matthew Honnibal
ab93898ac6 * Make heuristics more explicit in init_model 2015-07-26 00:22:19 +02:00
Matthew Honnibal
5c04dcd7c1 * Fix init_model 2015-07-25 23:33:02 +02:00
Matthew Honnibal
fd525f0675 * Pass OOV probability around 2015-07-25 23:29:51 +02:00
Matthew Honnibal
5b6bf4d4a6 * Remove probability cap on lexicon 2015-07-25 23:05:51 +02:00
Matthew Honnibal
c62eb110c0 * Fix merge conflict in init_model 2015-07-25 23:04:30 +02:00
Matthew Honnibal
0301472d15 * Fix init_model 2015-07-25 22:56:35 +02:00
Matthew Honnibal
8e800adfbc * Fix init_model 2015-07-25 22:54:08 +02:00
Matthew Honnibal
6076213c16 * Fix init_model script 2015-07-25 22:35:52 +02:00
Matthew Honnibal
ef448649b3 * Add read_freqs function in init_model 2015-07-25 22:16:36 +02:00
Matthew Honnibal
6be3ee311c Py3 compatibility tweak 2015-07-23 13:13:15 +02:00
Matthew Honnibal
d4407d8e2f Py3 compatibility tweak 2015-07-23 09:45:15 +02:00
Matthew Honnibal
da4821fc14 * Add cluster words to probs in init_model 2015-07-23 09:27:07 +02:00
Matthew Honnibal
4af2595d99 * Fix structure of wordnet directory for init_model 2015-07-23 06:35:38 +02:00
Matthew Honnibal
83c0f0da22 * Remove lemmatizer from init_model 2015-07-23 02:32:34 +02:00
Matthew Honnibal
386246db5b * Update init_model, making language resources optional 2015-07-22 00:25:14 +02:00
Matthew Honnibal
af54d05d60 * Remove sense stuff from init_model 2015-07-14 10:56:17 +02:00
Matthew Honnibal
62cfcd76fe * Add supersense sets to lexemes, from WordNet. Look-up via lemmatization. 2015-07-01 18:48:59 +02:00
Matthew Honnibal
c8a553fe91 * Fix cluster initialization 2015-05-31 15:21:28 +02:00
Matthew Honnibal
c037f80638 * Add case expansion to Brown clusters 2015-05-31 05:50:50 +02:00
Matthew Honnibal
5ab0f233a1 * Ensure words in Brown clusters make it into the vocab, even if they're not in our probs list 2015-05-31 05:46:16 +02:00
Matthew Honnibal
4489d87550 * Add cluster=0 by default in init_model 2015-04-29 14:23:13 +02:00
Matthew Honnibal
693c5a1558 * Exclude clusterings for words only seen 1 or 2 times, as their clusters are unreliable 2015-04-17 04:44:52 +02:00
Matthew Honnibal
1629b33082 * Fix copying of tokenizer data in init_model 2015-04-12 04:45:31 +02:00
Matthew Honnibal
baff0f8ad8 * Add docstring explaining script a bit, and add handling of word vectors 2015-04-08 08:20:15 +02:00
Matthew Honnibal
156b70ed82 * Add new script to replace make_lexicon, that does full setup of data 2015-04-08 07:46:53 +02:00