Commit Graph

  • f1c3e17c80 * Work on intro copy Matthew Honnibal 2014-11-03 00:13:19 +1100
  • fa91506073 * Add '' double quote to suffixes file Matthew Honnibal 2014-11-03 00:12:59 +1100
  • 493d5ffb50 * Add test for '' in punct Matthew Honnibal 2014-11-02 21:24:09 +1100
  • 711ed0f636 * Whitespace Matthew Honnibal 2014-11-02 14:22:32 +1100
  • fcd9490d56 * Add pos_tag method to Language Matthew Honnibal 2014-11-02 14:21:43 +1100
  • 99b5cefa88 * Add tests for emoticon tokenization Matthew Honnibal 2014-11-02 13:22:14 +1100
  • 23131f21bb * Add tests for like_url Matthew Honnibal 2014-11-02 13:21:57 +1100
  • dc6c3c0f56 * Add tests for like_number Matthew Honnibal 2014-11-02 13:21:39 +1100
  • 829bb2bdbe * Add mappings to Twitter POS tag corpus Matthew Honnibal 2014-11-02 13:21:19 +1100
  • 437cd2217d * Fix strings i/o, removing use of ujson library in favour of plain text file. Allows better control of codecs. Matthew Honnibal 2014-11-02 13:20:37 +1100
  • 3352e89e21 * Use LIKE_URL and LIKE_NUMBER flag features. Seems to improve accuracy on onto web Matthew Honnibal 2014-11-02 13:19:54 +1100
  • 8335706321 * Add LIKE_URL and LIKE_NUMBER flag features Matthew Honnibal 2014-11-02 13:19:05 +1100
  • c414d0eebe * Add tests for is_number Matthew Honnibal 2014-11-01 19:13:40 +1100
  • 5484fbea69 * Implement is_number Matthew Honnibal 2014-11-01 19:13:24 +1100
  • f685218e21 * Add is_urlish function Matthew Honnibal 2014-11-01 17:39:34 +1100
  • 11e42fd070 * Add emoticons to tokenization Matthew Honnibal 2014-11-01 15:14:46 +1100
  • 39743323ea * Add i'ma to tokenization rules Matthew Honnibal 2014-10-31 17:45:44 +1100
  • 09a3e54176 * Delete print statements from stringstore Matthew Honnibal 2014-10-31 17:45:26 +1100
  • b186a66bae * Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype Matthew Honnibal 2014-10-31 17:44:39 +1100
  • a8ca078b24 * Restore lexemes field to lexicon Matthew Honnibal 2014-10-31 17:43:25 +1100
  • 6c807aa45f * Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries Matthew Honnibal 2014-10-31 17:43:00 +1100
  • aaf6953fe0 * Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web. Matthew Honnibal 2014-10-31 17:42:15 +1100
  • f67cb9a5a3 * Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web. Matthew Honnibal 2014-10-31 17:42:04 +1100
  • 63114820cf * Upd tests for tighter interface Matthew Honnibal 2014-10-30 18:15:30 +1100
  • ea8f1e7053 * Tighten interfaces Matthew Honnibal 2014-10-30 18:14:42 +1100
  • ea85bf3a0a * Tighten the interface to Language Matthew Honnibal 2014-10-30 18:01:27 +1100
  • c6fcd03692 * Small efficiency tweak to lexeme init Matthew Honnibal 2014-10-30 17:56:11 +1100
  • 87c2418a89 * Fiddle with data types on Lexeme, to compress them to a much smaller size. Matthew Honnibal 2014-10-30 15:42:15 +1100
  • ac88893232 * Fix Token after lexeme changes Matthew Honnibal 2014-10-30 15:30:52 +1100
  • e6b87766fe * Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme Matthew Honnibal 2014-10-30 15:21:38 +1100
  • 889b7b48b4 * Fix POS tagger, so that it loads correctly. Lexemes are being read in. Matthew Honnibal 2014-10-30 13:38:55 +1100
  • 67c8c8019f * Update lexeme serialization, using a binary file format Matthew Honnibal 2014-10-30 01:01:00 +1100
  • 13909a2e24 * Rewriting Lexeme serialization. Matthew Honnibal 2014-10-29 23:19:38 +1100
  • 234d49bf4d * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. Matthew Honnibal 2014-10-24 02:23:42 +1100
  • 08ce602243 * Large refactor, particularly to Python API Matthew Honnibal 2014-10-24 00:59:17 +1100
  • 168b2b8cb2 * Add tests for string intern Matthew Honnibal 2014-10-23 20:47:06 +1100
  • 7baef5b7ff * Fix padding on tokens Matthew Honnibal 2014-10-23 04:01:17 +1100
  • 96b835a3d4 * Upd for refactored Tokens class. Now gets 95.74, 185ms training on swbd_wsj_ewtb, eval on onto_web, Google POS tags. Matthew Honnibal 2014-10-23 03:20:02 +1100
  • e5e951ae67 * Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding. Matthew Honnibal 2014-10-23 01:57:59 +1100
  • ea1d4a81eb * Refactoring get_atoms, improving tokens API Matthew Honnibal 2014-10-22 13:10:56 +1100
  • ad49e2482e * Tagger now gets 97pc on wsj, parsing 19-21 in 500ms. Gets 92.7 on web text. Matthew Honnibal 2014-10-22 12:57:06 +1100
  • 0a0e41f6c8 * Add prefix and suffix features Matthew Honnibal 2014-10-22 12:56:09 +1100
  • 7018b53d3a * Improve array features in tokens Matthew Honnibal 2014-10-22 12:55:42 +1100
  • 43d5964e13 * Add function to read detokenization rules Matthew Honnibal 2014-10-22 12:54:59 +1100
  • 077885637d * Add test for reading in POS tags Matthew Honnibal 2014-10-22 10:18:43 +1100
  • 224bdae996 * Add POS utilities Matthew Honnibal 2014-10-22 10:17:57 +1100
  • 5ebe14f353 * Add greedy pos tagger Matthew Honnibal 2014-10-22 10:17:26 +1100
  • 12742f4f83 * Add detokenize method and test Matthew Honnibal 2014-10-18 18:02:05 +1100
  • df110476d5 * Update docs Matthew Honnibal 2014-10-15 21:50:34 +1100
  • 849de654e7 * Add file for infix patterns Matthew Honnibal 2014-10-14 20:26:43 +1100
  • 31aad7c08a * Test hyphenation etc Matthew Honnibal 2014-10-14 20:26:16 +1100
  • 99f5e59286 * Have tokenizer emit tokens for whitespace other than single spaces Matthew Honnibal 2014-10-14 20:25:57 +1100
  • 43743a5d63 * Work on efficiency Matthew Honnibal 2014-10-14 18:22:41 +1100
  • 6fb42c4919 * Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang Matthew Honnibal 2014-10-14 15:47:06 +1100
  • 2805068ca8 * Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer Matthew Honnibal 2014-10-14 15:21:03 +1100
  • 65d3ead4fd * Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id Matthew Honnibal 2014-10-14 15:19:07 +1100
  • 5abb194553 * Add semi-colon to suffix punct Matthew Honnibal 2014-10-14 10:43:45 +1100
  • 868e558037 * Preparations in place to handle hyphenation etc Matthew Honnibal 2014-10-10 20:23:23 +1100
  • ff79dbac2e * More slight cleaning for lang.pyx Matthew Honnibal 2014-10-10 20:11:22 +1100
  • 3d82ed1e5e * More slight cleaning for lang.pyx Matthew Honnibal 2014-10-10 19:50:07 +1100
  • 02e948e7d5 * Remove counts stuff from Language class Matthew Honnibal 2014-10-10 19:25:01 +1100
  • 71ee921055 * Slight cleaning of tokenizer code Matthew Honnibal 2014-10-10 19:17:22 +1100
  • 59b41a9fd3 * Switch to new data model, tests passing Matthew Honnibal 2014-10-10 08:11:31 +1100
  • 1b0e01d3d8 * Revising data model of lexeme. Compiles. Matthew Honnibal 2014-10-09 19:53:30 +1100
  • e40caae51f * Update Lexicon class to expect a list of lexeme dict descriptions Matthew Honnibal 2014-10-09 14:51:35 +1100
  • 51d75b244b * Add serialize/deserialize functions for lexeme, transport to/from python dict. Matthew Honnibal 2014-10-09 14:10:46 +1100
  • d73d89a2de * Add i attribute to lexeme, giving lexemes sequential IDs. Matthew Honnibal 2014-10-09 13:50:05 +1100
  • 0c6402ab73 * Upd docs Matthew Honnibal 2014-09-26 18:40:18 +0200
  • 096ef2b199 * Rename external hashing lib, from trustyc to preshed Matthew Honnibal 2014-09-26 18:40:03 +0200
  • 11a346fd5e * Remove hashing modules, which are now taken over by external lib Matthew Honnibal 2014-09-26 18:39:40 +0200
  • bfab6403bc * Re-add docs, sorting out mess from gh-pages Matthew Honnibal 2014-09-25 18:42:20 +0200
  • aba4a7c7ea * Remove ptb3 file from setup Matthew Honnibal 2014-09-25 18:41:25 +0200
  • bc460de171 * Add extra tests Matthew Honnibal 2014-09-25 18:29:42 +0200
  • 93505276ed * Add German tokenizer files Matthew Honnibal 2014-09-25 18:29:13 +0200
  • 2e44fa7179 * Add util.py Matthew Honnibal 2014-09-25 18:26:22 +0200
  • c4cd3bc57a * Add prefix and suffix data files Matthew Honnibal 2014-09-25 18:24:52 +0200
  • 2d4e5ceafd * Remove old docs stuff Matthew Honnibal 2014-09-25 18:24:05 +0200
  • b15619e170 * Use PointerHash instead of locally provided _hashing module Matthew Honnibal 2014-09-25 18:22:52 +0200
  • ed446c67ad * Add typedefs file Matthew Honnibal 2014-09-17 23:10:32 +0200
  • 316a57c4be * Remove own memory classes, which have now been broken out into their own package Matthew Honnibal 2014-09-17 23:10:07 +0200
  • ac522e2553 * Switch from own memory class to cymem, in pip Matthew Honnibal 2014-09-17 23:09:24 +0200
  • 6266cac593 * Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks Matthew Honnibal 2014-09-17 20:02:26 +0200
  • 5a20dfc03e * Add memory management code Matthew Honnibal 2014-09-17 20:02:06 +0200
  • 0152831c89 * Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. Matthew Honnibal 2014-09-16 18:01:46 +0200
  • 143e51ec73 * Refactor tokenization, splitting it into a clearer life-cycle. Matthew Honnibal 2014-09-16 13:16:02 +0200
  • c396581a0b * Fiddle with the way strings are interned in lexeme Matthew Honnibal 2014-09-15 06:34:45 +0200
  • 0bb547ab98 * Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance Matthew Honnibal 2014-09-15 06:33:53 +0200
  • 7959141d36 * Add a few abbreviations, to get tests to pass Matthew Honnibal 2014-09-15 06:32:18 +0200
  • db191361ee * Add new tests for fancier tokenization cases Matthew Honnibal 2014-09-15 06:31:58 +0200
  • 6fc06bfe2f * Hack a hard-cased unit in to get a test to pass Matthew Honnibal 2014-09-15 06:31:35 +0200
  • d235299260 * Few nips and tucks to hash table Matthew Honnibal 2014-09-15 05:03:44 +0200
  • e68a431e5e * Pass only the tokens vector to _tokenize, instead of the whole python object. Matthew Honnibal 2014-09-15 04:01:38 +0200
  • 08cef75ffd * Switch to using a heap-allocated vector in tokens Matthew Honnibal 2014-09-15 03:46:14 +0200
  • f77b7098c0 * Upd Tokens to use vector, with bounds checking. Matthew Honnibal 2014-09-15 03:22:40 +0200
  • 0f6bf2a2ee * Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector. Matthew Honnibal 2014-09-15 02:08:39 +0200
  • 5dcc1a426a * Update tokenization tests for new tokenizer rules Matthew Honnibal 2014-09-15 01:32:51 +0200
  • df24e3708c * Move EnglishTokens stuff to Tokens Matthew Honnibal 2014-09-15 01:31:44 +0200
  • bd08cb09a2 * Remove short-circuiting of initial_size argument for PointerHash Matthew Honnibal 2014-09-15 01:30:49 +0200
  • f3393cf57c * Improve interface for PointerHash Matthew Honnibal 2014-09-13 17:29:58 +0200
  • 45865be37e * Switch hash interface, using void* instead of size_t, to avoid casts. Matthew Honnibal 2014-09-13 17:02:06 +0200