Commit Graph

  • af9ed18cf1 * Bug fixes to NER Matthew Honnibal 2014-11-10 17:39:23 +1100
  • d7b2843643 * Add some tests for ner Matthew Honnibal 2014-11-10 16:29:19 +1100
  • 9f2587f5ec * Work on shift-reduce NER Matthew Honnibal 2014-11-10 16:28:56 +1100
  • f307eb2e36 * Refactor context extraction, and start breaking out gold standards into their own functions Matthew Honnibal 2014-11-09 15:43:07 +1100
  • 602f993af9 * Moving tagger to accept multiple correct answers Matthew Honnibal 2014-11-09 15:18:33 +1100
  • 10a33ec725 * Upd fabfile for experiments Matthew Honnibal 2014-11-07 04:44:14 +1100
  • f37d896a42 * Upd NER feats. With adadelta learner, getting 76.9 on NER Matthew Honnibal 2014-11-07 04:43:54 +1100
  • a42321bd4e * Upd shape test Matthew Honnibal 2014-11-07 04:42:54 +1100
  • 68d1cdad62 * When encoding POS/NER tags, accept '-' as a missing value Matthew Honnibal 2014-11-07 04:42:31 +1100
  • 949a6245f9 * Increase default number of iterations from 5 to 10 Matthew Honnibal 2014-11-07 04:42:04 +1100
  • 3cab1d9a29 * Refine word_shape feature, by trimming the max sequence length Matthew Honnibal 2014-11-07 04:41:29 +1100
  • b4454cf036 * Add extra context tokens Matthew Honnibal 2014-11-07 04:40:36 +1100
  • 50309e6e49 * Fix context vector, importing all features Matthew Honnibal 2014-11-05 22:11:39 +1100
  • 07a23768de * Play with NER feats a bit. Up to 82.00 training on MUC7. Matthew Honnibal 2014-11-05 21:47:17 +1100
  • edf739134c * Make make quiet by default, and add a vmake option for verbose make Matthew Honnibal 2014-11-05 20:46:29 +1100
  • dbbb914480 * Upd setup Matthew Honnibal 2014-11-05 20:45:44 +1100
  • 4ecbe8c893 * Complete refactor of Tagger features, to use a generic list of context names. Matthew Honnibal 2014-11-05 20:45:29 +1100
  • 0a8c84625d * Moving feature context stuff to a generalized place Matthew Honnibal 2014-11-05 19:55:10 +1100
  • 3733444101 * Generalize tagger code, in preparation for NER and supersense tagging. Matthew Honnibal 2014-11-05 03:42:14 +1100
  • 81da61f3cf * Remove out-dated POS data test Matthew Honnibal 2014-11-05 02:04:12 +1100
  • 0de700b566 * Comment out tests of hyphenation, while we decide what hyphenation policy should be. Matthew Honnibal 2014-11-05 02:03:22 +1100
  • abbe3e44b0 * Move spacy.pos tagger to spacy.tagger, and generalize it so that it can take on other tagging tasks, given a different set of feature templates. Matthew Honnibal 2014-11-05 00:37:59 +1100
  • 2420d944cb * Upd sales copy Matthew Honnibal 2014-11-04 17:01:54 +1100
  • 954c970415 * Add __iter__ method to tokens Matthew Honnibal 2014-11-04 01:07:08 +1100
  • f07457a91f * Remove POS alignment stuff. Now use training data based on raw text, instead of clumsy detokenization stuff Matthew Honnibal 2014-11-04 01:06:43 +1100
  • bea762ec04 * Update tokenization rules Matthew Honnibal 2014-11-04 01:06:00 +1100
  • b8d5881333 * Update sales copy Matthew Honnibal 2014-11-03 13:54:18 +1100
  • ae52f9f38c * Remove vocab10k from tokens Matthew Honnibal 2014-11-03 00:23:20 +1100
  • 11915e5238 * Update tests Matthew Honnibal 2014-11-03 00:23:04 +1100
  • 75329e9ef8 * Add Co. abbreviation to tokenization rules Matthew Honnibal 2014-11-03 00:16:20 +1100
  • 32fb50dc35 * Remove non_sparse method --- features wanting this can do it easily enough. Matthew Honnibal 2014-11-03 00:15:47 +1100
  • b5ae1471db * Fiddle with POS tag features Matthew Honnibal 2014-11-03 00:15:03 +1100
  • 70ea862703 * Remove vocab10k field, and add flags for gazetteers Matthew Honnibal 2014-11-03 00:13:51 +1100
  • f1c3e17c80 * Work on intro copy Matthew Honnibal 2014-11-03 00:13:19 +1100
  • fa91506073 * Add '' double quote to suffixes file Matthew Honnibal 2014-11-03 00:12:59 +1100
  • 493d5ffb50 * Add test for '' in punct Matthew Honnibal 2014-11-02 21:24:09 +1100
  • 711ed0f636 * Whitespace Matthew Honnibal 2014-11-02 14:22:32 +1100
  • fcd9490d56 * Add pos_tag method to Language Matthew Honnibal 2014-11-02 14:21:43 +1100
  • 99b5cefa88 * Add tests for emoticon tokenization Matthew Honnibal 2014-11-02 13:22:14 +1100
  • 23131f21bb * Add tests for like_url Matthew Honnibal 2014-11-02 13:21:57 +1100
  • dc6c3c0f56 * Add tests for like_number Matthew Honnibal 2014-11-02 13:21:39 +1100
  • 829bb2bdbe * Add mappings to Twitter POS tag corpus Matthew Honnibal 2014-11-02 13:21:19 +1100
  • 437cd2217d * Fix strings i/o, removing use of ujson library in favour of plain text file. Allows better control of codecs. Matthew Honnibal 2014-11-02 13:20:37 +1100
  • 3352e89e21 * Use LIKE_URL and LIKE_NUMBER flag features. Seems to improve accuracy on onto web Matthew Honnibal 2014-11-02 13:19:54 +1100
  • 8335706321 * Add LIKE_URL and LIKE_NUMBER flag features Matthew Honnibal 2014-11-02 13:19:05 +1100
  • c414d0eebe * Add tests for is_number Matthew Honnibal 2014-11-01 19:13:40 +1100
  • 5484fbea69 * Implement is_number Matthew Honnibal 2014-11-01 19:13:24 +1100
  • f685218e21 * Add is_urlish function Matthew Honnibal 2014-11-01 17:39:34 +1100
  • 11e42fd070 * Add emoticons to tokenization Matthew Honnibal 2014-11-01 15:14:46 +1100
  • 39743323ea * Add i'ma to tokenization rules Matthew Honnibal 2014-10-31 17:45:44 +1100
  • 09a3e54176 * Delete print statements from stringstore Matthew Honnibal 2014-10-31 17:45:26 +1100
  • b186a66bae * Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype Matthew Honnibal 2014-10-31 17:44:39 +1100
  • a8ca078b24 * Restore lexemes field to lexicon Matthew Honnibal 2014-10-31 17:43:25 +1100
  • 6c807aa45f * Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries Matthew Honnibal 2014-10-31 17:43:00 +1100
  • aaf6953fe0 * Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web. Matthew Honnibal 2014-10-31 17:42:15 +1100
  • f67cb9a5a3 * Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web. Matthew Honnibal 2014-10-31 17:42:04 +1100
  • 63114820cf * Upd tests for tighter interface Matthew Honnibal 2014-10-30 18:15:30 +1100
  • ea8f1e7053 * Tighten interfaces Matthew Honnibal 2014-10-30 18:14:42 +1100
  • ea85bf3a0a * Tighten the interface to Language Matthew Honnibal 2014-10-30 18:01:27 +1100
  • c6fcd03692 * Small efficiency tweak to lexeme init Matthew Honnibal 2014-10-30 17:56:11 +1100
  • 87c2418a89 * Fiddle with data types on Lexeme, to compress them to a much smaller size. Matthew Honnibal 2014-10-30 15:42:15 +1100
  • ac88893232 * Fix Token after lexeme changes Matthew Honnibal 2014-10-30 15:30:52 +1100
  • e6b87766fe * Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme Matthew Honnibal 2014-10-30 15:21:38 +1100
  • 889b7b48b4 * Fix POS tagger, so that it loads correctly. Lexemes are being read in. Matthew Honnibal 2014-10-30 13:38:55 +1100
  • 67c8c8019f * Update lexeme serialization, using a binary file format Matthew Honnibal 2014-10-30 01:01:00 +1100
  • 13909a2e24 * Rewriting Lexeme serialization. Matthew Honnibal 2014-10-29 23:19:38 +1100
  • 234d49bf4d * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. Matthew Honnibal 2014-10-24 02:23:42 +1100
  • 08ce602243 * Large refactor, particularly to Python API Matthew Honnibal 2014-10-24 00:59:17 +1100
  • 168b2b8cb2 * Add tests for string intern Matthew Honnibal 2014-10-23 20:47:06 +1100
  • 7baef5b7ff * Fix padding on tokens Matthew Honnibal 2014-10-23 04:01:17 +1100
  • 96b835a3d4 * Upd for refactored Tokens class. Now gets 95.74, 185ms training on swbd_wsj_ewtb, eval on onto_web, Google POS tags. Matthew Honnibal 2014-10-23 03:20:02 +1100
  • e5e951ae67 * Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding. Matthew Honnibal 2014-10-23 01:57:59 +1100
  • ea1d4a81eb * Refactoring get_atoms, improving tokens API Matthew Honnibal 2014-10-22 13:10:56 +1100
  • ad49e2482e * Tagger now gets 97pc on wsj, parsing 19-21 in 500ms. Gets 92.7 on web text. Matthew Honnibal 2014-10-22 12:57:06 +1100
  • 0a0e41f6c8 * Add prefix and suffix features Matthew Honnibal 2014-10-22 12:56:09 +1100
  • 7018b53d3a * Improve array features in tokens Matthew Honnibal 2014-10-22 12:55:42 +1100
  • 43d5964e13 * Add function to read detokenization rules Matthew Honnibal 2014-10-22 12:54:59 +1100
  • 077885637d * Add test for reading in POS tags Matthew Honnibal 2014-10-22 10:18:43 +1100
  • 224bdae996 * Add POS utilities Matthew Honnibal 2014-10-22 10:17:57 +1100
  • 5ebe14f353 * Add greedy pos tagger Matthew Honnibal 2014-10-22 10:17:26 +1100
  • 12742f4f83 * Add detokenize method and test Matthew Honnibal 2014-10-18 18:02:05 +1100
  • df110476d5 * Update docs Matthew Honnibal 2014-10-15 21:50:34 +1100
  • 849de654e7 * Add file for infix patterns Matthew Honnibal 2014-10-14 20:26:43 +1100
  • 31aad7c08a * Test hyphenation etc Matthew Honnibal 2014-10-14 20:26:16 +1100
  • 99f5e59286 * Have tokenizer emit tokens for whitespace other than single spaces Matthew Honnibal 2014-10-14 20:25:57 +1100
  • 43743a5d63 * Work on efficiency Matthew Honnibal 2014-10-14 18:22:41 +1100
  • 6fb42c4919 * Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang Matthew Honnibal 2014-10-14 15:47:06 +1100
  • 2805068ca8 * Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer Matthew Honnibal 2014-10-14 15:21:03 +1100
  • 65d3ead4fd * Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id Matthew Honnibal 2014-10-14 15:19:07 +1100
  • 5abb194553 * Add semi-colon to suffix punct Matthew Honnibal 2014-10-14 10:43:45 +1100
  • 868e558037 * Preparations in place to handle hyphenation etc Matthew Honnibal 2014-10-10 20:23:23 +1100
  • ff79dbac2e * More slight cleaning for lang.pyx Matthew Honnibal 2014-10-10 20:11:22 +1100
  • 3d82ed1e5e * More slight cleaning for lang.pyx Matthew Honnibal 2014-10-10 19:50:07 +1100
  • 02e948e7d5 * Remove counts stuff from Language class Matthew Honnibal 2014-10-10 19:25:01 +1100
  • 71ee921055 * Slight cleaning of tokenizer code Matthew Honnibal 2014-10-10 19:17:22 +1100
  • 59b41a9fd3 * Switch to new data model, tests passing Matthew Honnibal 2014-10-10 08:11:31 +1100
  • 1b0e01d3d8 * Revising data model of lexeme. Compiles. Matthew Honnibal 2014-10-09 19:53:30 +1100
  • e40caae51f * Update Lexicon class to expect a list of lexeme dict descriptions Matthew Honnibal 2014-10-09 14:51:35 +1100
  • 51d75b244b * Add serialize/deserialize functions for lexeme, transport to/from python dict. Matthew Honnibal 2014-10-09 14:10:46 +1100
  • d73d89a2de * Add i attribute to lexeme, giving lexemes sequential IDs. Matthew Honnibal 2014-10-09 13:50:05 +1100