Commit Graph

  • 5abb194553 * Add semi-colon to suffix punct Matthew Honnibal 2014-10-14 10:43:45 +1100
  • 868e558037 * Preparations in place to handle hyphenation etc Matthew Honnibal 2014-10-10 20:23:23 +1100
  • ff79dbac2e * More slight cleaning for lang.pyx Matthew Honnibal 2014-10-10 20:11:22 +1100
  • 3d82ed1e5e * More slight cleaning for lang.pyx Matthew Honnibal 2014-10-10 19:50:07 +1100
  • 02e948e7d5 * Remove counts stuff from Language class Matthew Honnibal 2014-10-10 19:25:01 +1100
  • 71ee921055 * Slight cleaning of tokenizer code Matthew Honnibal 2014-10-10 19:17:22 +1100
  • 59b41a9fd3 * Switch to new data model, tests passing Matthew Honnibal 2014-10-10 08:11:31 +1100
  • 1b0e01d3d8 * Revising data model of lexeme. Compiles. Matthew Honnibal 2014-10-09 19:53:30 +1100
  • e40caae51f * Update Lexicon class to expect a list of lexeme dict descriptions Matthew Honnibal 2014-10-09 14:51:35 +1100
  • 51d75b244b * Add serialize/deserialize functions for lexeme, transport to/from python dict. Matthew Honnibal 2014-10-09 14:10:46 +1100
  • d73d89a2de * Add i attribute to lexeme, giving lexemes sequential IDs. Matthew Honnibal 2014-10-09 13:50:05 +1100
  • 0c6402ab73 * Upd docs Matthew Honnibal 2014-09-26 18:40:18 +0200
  • 096ef2b199 * Rename external hashing lib, from trustyc to preshed Matthew Honnibal 2014-09-26 18:40:03 +0200
  • 11a346fd5e * Remove hashing modules, which are now taken over by external lib Matthew Honnibal 2014-09-26 18:39:40 +0200
  • bfab6403bc * Re-add docs, sorting out mess from gh-pages Matthew Honnibal 2014-09-25 18:42:20 +0200
  • aba4a7c7ea * Remove ptb3 file from setup Matthew Honnibal 2014-09-25 18:41:25 +0200
  • bc460de171 * Add extra tests Matthew Honnibal 2014-09-25 18:29:42 +0200
  • 93505276ed * Add German tokenizer files Matthew Honnibal 2014-09-25 18:29:13 +0200
  • 2e44fa7179 * Add util.py Matthew Honnibal 2014-09-25 18:26:22 +0200
  • c4cd3bc57a * Add prefix and suffix data files Matthew Honnibal 2014-09-25 18:24:52 +0200
  • 2d4e5ceafd * Remove old docs stuff Matthew Honnibal 2014-09-25 18:24:05 +0200
  • b15619e170 * Use PointerHash instead of locally provided _hashing module Matthew Honnibal 2014-09-25 18:22:52 +0200
  • ed446c67ad * Add typedefs file Matthew Honnibal 2014-09-17 23:10:32 +0200
  • 316a57c4be * Remove own memory classes, which have now been broken out into their own package Matthew Honnibal 2014-09-17 23:10:07 +0200
  • ac522e2553 * Switch from own memory class to cymem, in pip Matthew Honnibal 2014-09-17 23:09:24 +0200
  • 6266cac593 * Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks Matthew Honnibal 2014-09-17 20:02:26 +0200
  • 5a20dfc03e * Add memory management code Matthew Honnibal 2014-09-17 20:02:06 +0200
  • 0152831c89 * Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. Matthew Honnibal 2014-09-16 18:01:46 +0200
  • 143e51ec73 * Refactor tokenization, splitting it into a clearer life-cycle. Matthew Honnibal 2014-09-16 13:16:02 +0200
  • c396581a0b * Fiddle with the way strings are interned in lexeme Matthew Honnibal 2014-09-15 06:34:45 +0200
  • 0bb547ab98 * Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance Matthew Honnibal 2014-09-15 06:33:53 +0200
  • 7959141d36 * Add a few abbreviations, to get tests to pass Matthew Honnibal 2014-09-15 06:32:18 +0200
  • db191361ee * Add new tests for fancier tokenization cases Matthew Honnibal 2014-09-15 06:31:58 +0200
  • 6fc06bfe2f * Hack a hard-cased unit in to get a test to pass Matthew Honnibal 2014-09-15 06:31:35 +0200
  • d235299260 * Few nips and tucks to hash table Matthew Honnibal 2014-09-15 05:03:44 +0200
  • e68a431e5e * Pass only the tokens vector to _tokenize, instead of the whole python object. Matthew Honnibal 2014-09-15 04:01:38 +0200
  • 08cef75ffd * Switch to using a heap-allocated vector in tokens Matthew Honnibal 2014-09-15 03:46:14 +0200
  • f77b7098c0 * Upd Tokens to use vector, with bounds checking. Matthew Honnibal 2014-09-15 03:22:40 +0200
  • 0f6bf2a2ee * Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector. Matthew Honnibal 2014-09-15 02:08:39 +0200
  • 5dcc1a426a * Update tokenization tests for new tokenizer rules Matthew Honnibal 2014-09-15 01:32:51 +0200
  • df24e3708c * Move EnglishTokens stuff to Tokens Matthew Honnibal 2014-09-15 01:31:44 +0200
  • bd08cb09a2 * Remove short-circuiting of initial_size argument for PointerHash Matthew Honnibal 2014-09-15 01:30:49 +0200
  • f3393cf57c * Improve interface for PointerHash Matthew Honnibal 2014-09-13 17:29:58 +0200
  • 45865be37e * Switch hash interface, using void* instead of size_t, to avoid casts. Matthew Honnibal 2014-09-13 17:02:06 +0200
  • 0447279c57 * PointerHash working, efficiency is good. 6-7 mins Matthew Honnibal 2014-09-13 16:43:42 +0200
  • 85d68e8e95 * Replaced cache with own hash table. Similar timing Matthew Honnibal 2014-09-13 03:14:43 +0200
  • c8db76e3e1 * Add initial work on simple hash table Matthew Honnibal 2014-09-13 02:02:41 +0200
  • afdc9b7ac2 * More performance fiddling, particularly moving the specials into the cache, so that we can just lookup the cache in _tokenize Matthew Honnibal 2014-09-13 00:59:34 +0200
  • 7d239df4c8 * Fiddle with declarations, for small efficiency boost Matthew Honnibal 2014-09-13 00:31:53 +0200
  • a8e7cce30f * Efficiency tweaks Matthew Honnibal 2014-09-13 00:14:05 +0200
  • 126a8453a5 * Fix performance issues by implementing a better cache. Add own String struct to help Matthew Honnibal 2014-09-12 23:50:37 +0200
  • 9298e36b36 * Move special tokenization into its own lookup table, away from the cache. Matthew Honnibal 2014-09-12 19:43:14 +0200
  • 985bc68327 * Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation. Matthew Honnibal 2014-09-12 18:00:42 +0200
  • 7eab281194 * Fiddle with token features Matthew Honnibal 2014-09-12 15:49:55 +0200
  • 5aa591106b * Fiddle with token features Matthew Honnibal 2014-09-12 15:49:36 +0200
  • 1533041885 * Update the split_one method, so that it doesn't need to cast back to a Python object Matthew Honnibal 2014-09-12 05:10:59 +0200
  • 4817277d66 * Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery. Matthew Honnibal 2014-09-12 04:29:09 +0200
  • 8b20e9ad97 * Delete ununused _split method Matthew Honnibal 2014-09-12 04:03:52 +0200
  • a4863686ec * Changed cache to use a linked-list data structure, to take out Python list code. Taking 6-7 mins for gigaword. Matthew Honnibal 2014-09-12 03:30:50 +0200
  • 51e2006a65 * Increase cache size. Processing now 6-7 mins Matthew Honnibal 2014-09-12 02:52:34 +0200
  • e096f30161 * Tweak signatures and refactor slightly. Processing gigaword taking 8-9 mins. Tests passing, but some sort of memory bug on exit. Matthew Honnibal 2014-09-12 02:43:36 +0200
  • 073ee0de63 * Restore dense_hash_map for cache dictionary. Seems to double efficiency Matthew Honnibal 2014-09-12 02:23:51 +0200
  • 3c928fb5e0 * Switch to 64 bit hashes, for better reliability Matthew Honnibal 2014-09-12 02:04:47 +0200
  • 2389bd1b10 * Improve cache mechanism by including a random element depending on the size of the cache. Matthew Honnibal 2014-09-12 00:18:31 +0200
  • c8f7c8bfde * Moving to storing LexemeC structs internally Matthew Honnibal 2014-09-11 21:54:34 +0200
  • bf9c60c31c * Moving to storing LexemeC structs internally Matthew Honnibal 2014-09-11 21:44:58 +0200
  • 563047e90f * Switch to returning a Tokens object Matthew Honnibal 2014-09-11 21:37:32 +0200
  • 1a3222af4b * Moving tokens to use an array internally, instead of a list of Lexeme objects. Matthew Honnibal 2014-09-11 16:57:08 +0200
  • 5b1c651661 * Only store LexemeC structs in the vocabulary, transforming them to Lexeme objects for output. Moving away from Lexeme objects for Tokens soon. Matthew Honnibal 2014-09-11 12:28:38 +0200
  • b5b31c6b6e * Avoid testing for object identity Matthew Honnibal 2014-09-10 20:58:30 +0200
  • e567713429 * Moving back to lexeme structs Matthew Honnibal 2014-09-10 20:41:47 +0200
  • b488224c09 * Restoring Lexeme-as-struct Matthew Honnibal 2014-09-10 20:41:37 +0200
  • e80d3b9784 * Compile tokens in setup Matthew Honnibal 2014-09-10 19:41:19 +0200
  • 7c09c73a14 * Refactor to use tokens class. Matthew Honnibal 2014-09-10 18:27:44 +0200
  • cf412adba8 * Refactoring to use Tokens object Matthew Honnibal 2014-09-10 18:11:13 +0200
  • b8c4549ffe * Tweak overview docs Matthew Honnibal 2014-09-07 21:29:41 +0200
  • 7dac9b9ccb * Fix setup script Matthew Honnibal 2014-09-01 23:41:59 +0200
  • 5ee4d8c641 * Work on tests for flag features Matthew Honnibal 2014-09-01 23:41:43 +0200
  • 8fbe9b6f97 * Bug fixes to flag features Matthew Honnibal 2014-09-01 23:41:31 +0200
  • bf47429368 * Add tests for non_sparse string transform Matthew Honnibal 2014-09-01 23:27:31 +0200
  • c50433163f * Add tests for flag features Matthew Honnibal 2014-09-01 23:27:09 +0200
  • 786a4a86fe * Add tests for canon_case Matthew Honnibal 2014-09-01 23:26:49 +0200
  • 4c7b997df7 * Add tests for word shape features Matthew Honnibal 2014-09-01 23:26:17 +0200
  • c5abb81f4c * Add incomplete tests of asciify function Matthew Honnibal 2014-09-01 23:25:51 +0200
  • 151aa14bba * Add asciify string transform, and other bits. Matthew Honnibal 2014-09-01 23:25:28 +0200
  • c4ba216642 * Switch canon_case to get value, to avoid keyerror Matthew Honnibal 2014-09-01 17:27:36 +0200
  • a779275a59 * Add canon_case function Matthew Honnibal 2014-08-30 20:57:43 +0200
  • 8bbfadfced * Pass tests. Need to implement more feature functions. Matthew Honnibal 2014-08-30 20:36:06 +0200
  • dcab14ede2 * Begin testing more functionality Matthew Honnibal 2014-08-30 19:01:15 +0200
  • 3e3ff99ca0 * Add orth features Matthew Honnibal 2014-08-30 19:01:00 +0200
  • 6209d94f83 * Add tests for word shape Matthew Honnibal 2014-08-30 19:00:10 +0200
  • 4e5b2d47e2 * More docs Matthew Honnibal 2014-08-29 03:01:40 +0200
  • 5233f110c4 * Adding PTB3 tokenizer back in, so can understand how much boilerplate is in the docs for multiple tokenizers Matthew Honnibal 2014-08-29 02:30:27 +0200
  • 45a22d6b2c * Docs coming together Matthew Honnibal 2014-08-29 01:59:23 +0200
  • c282e6d5fb * Redesign proceeding Matthew Honnibal 2014-08-28 19:45:09 +0200
  • fd4e61e58b * Fixed contraction tests. Need to correct problem with the way case stats and tag stats are supposed to work. Matthew Honnibal 2014-08-27 20:22:33 +0200
  • fdaf24604a * Basic punct tests updated and passing Matthew Honnibal 2014-08-27 19:38:57 +0200
  • 8d20617dfd * Whitespace Matthew Honnibal 2014-08-27 17:16:16 +0200
  • e9a62b6eba * Refactoring with Lexeme as a class now compiles. Basic design seems to work Matthew Honnibal 2014-08-27 17:15:39 +0200
  • 68bae2fec6 * More refactoring Matthew Honnibal 2014-08-25 16:42:22 +0200