Commit Graph

  • 0f6bf2a2ee * Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector. Matthew Honnibal 2014-09-15 02:08:39 +0200
  • 5dcc1a426a * Update tokenization tests for new tokenizer rules Matthew Honnibal 2014-09-15 01:32:51 +0200
  • df24e3708c * Move EnglishTokens stuff to Tokens Matthew Honnibal 2014-09-15 01:31:44 +0200
  • bd08cb09a2 * Remove short-circuiting of initial_size argument for PointerHash Matthew Honnibal 2014-09-15 01:30:49 +0200
  • f3393cf57c * Improve interface for PointerHash Matthew Honnibal 2014-09-13 17:29:58 +0200
  • 45865be37e * Switch hash interface, using void* instead of size_t, to avoid casts. Matthew Honnibal 2014-09-13 17:02:06 +0200
  • 0447279c57 * PointerHash working, efficiency is good. 6-7 mins Matthew Honnibal 2014-09-13 16:43:42 +0200
  • 85d68e8e95 * Replaced cache with own hash table. Similar timing Matthew Honnibal 2014-09-13 03:14:43 +0200
  • c8db76e3e1 * Add initial work on simple hash table Matthew Honnibal 2014-09-13 02:02:41 +0200
  • afdc9b7ac2 * More performance fiddling, particularly moving the specials into the cache, so that we can just lookup the cache in _tokenize Matthew Honnibal 2014-09-13 00:59:34 +0200
  • 7d239df4c8 * Fiddle with declarations, for small efficiency boost Matthew Honnibal 2014-09-13 00:31:53 +0200
  • a8e7cce30f * Efficiency tweaks Matthew Honnibal 2014-09-13 00:14:05 +0200
  • 126a8453a5 * Fix performance issues by implementing a better cache. Add own String struct to help Matthew Honnibal 2014-09-12 23:50:37 +0200
  • 9298e36b36 * Move special tokenization into its own lookup table, away from the cache. Matthew Honnibal 2014-09-12 19:43:14 +0200
  • 985bc68327 * Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation. Matthew Honnibal 2014-09-12 18:00:42 +0200
  • 7eab281194 * Fiddle with token features Matthew Honnibal 2014-09-12 15:49:55 +0200
  • 5aa591106b * Fiddle with token features Matthew Honnibal 2014-09-12 15:49:36 +0200
  • 1533041885 * Update the split_one method, so that it doesn't need to cast back to a Python object Matthew Honnibal 2014-09-12 05:10:59 +0200
  • 4817277d66 * Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery. Matthew Honnibal 2014-09-12 04:29:09 +0200
  • 8b20e9ad97 * Delete ununused _split method Matthew Honnibal 2014-09-12 04:03:52 +0200
  • a4863686ec * Changed cache to use a linked-list data structure, to take out Python list code. Taking 6-7 mins for gigaword. Matthew Honnibal 2014-09-12 03:30:50 +0200
  • 51e2006a65 * Increase cache size. Processing now 6-7 mins Matthew Honnibal 2014-09-12 02:52:34 +0200
  • e096f30161 * Tweak signatures and refactor slightly. Processing gigaword taking 8-9 mins. Tests passing, but some sort of memory bug on exit. Matthew Honnibal 2014-09-12 02:43:36 +0200
  • 073ee0de63 * Restore dense_hash_map for cache dictionary. Seems to double efficiency Matthew Honnibal 2014-09-12 02:23:51 +0200
  • 3c928fb5e0 * Switch to 64 bit hashes, for better reliability Matthew Honnibal 2014-09-12 02:04:47 +0200
  • 2389bd1b10 * Improve cache mechanism by including a random element depending on the size of the cache. Matthew Honnibal 2014-09-12 00:18:31 +0200
  • c8f7c8bfde * Moving to storing LexemeC structs internally Matthew Honnibal 2014-09-11 21:54:34 +0200
  • bf9c60c31c * Moving to storing LexemeC structs internally Matthew Honnibal 2014-09-11 21:44:58 +0200
  • 563047e90f * Switch to returning a Tokens object Matthew Honnibal 2014-09-11 21:37:32 +0200
  • 1a3222af4b * Moving tokens to use an array internally, instead of a list of Lexeme objects. Matthew Honnibal 2014-09-11 16:57:08 +0200
  • 5b1c651661 * Only store LexemeC structs in the vocabulary, transforming them to Lexeme objects for output. Moving away from Lexeme objects for Tokens soon. Matthew Honnibal 2014-09-11 12:28:38 +0200
  • b5b31c6b6e * Avoid testing for object identity Matthew Honnibal 2014-09-10 20:58:30 +0200
  • e567713429 * Moving back to lexeme structs Matthew Honnibal 2014-09-10 20:41:47 +0200
  • b488224c09 * Restoring Lexeme-as-struct Matthew Honnibal 2014-09-10 20:41:37 +0200
  • e80d3b9784 * Compile tokens in setup Matthew Honnibal 2014-09-10 19:41:19 +0200
  • 7c09c73a14 * Refactor to use tokens class. Matthew Honnibal 2014-09-10 18:27:44 +0200
  • cf412adba8 * Refactoring to use Tokens object Matthew Honnibal 2014-09-10 18:11:13 +0200
  • b8c4549ffe * Tweak overview docs Matthew Honnibal 2014-09-07 21:29:41 +0200
  • 7dac9b9ccb * Fix setup script Matthew Honnibal 2014-09-01 23:41:59 +0200
  • 5ee4d8c641 * Work on tests for flag features Matthew Honnibal 2014-09-01 23:41:43 +0200
  • 8fbe9b6f97 * Bug fixes to flag features Matthew Honnibal 2014-09-01 23:41:31 +0200
  • bf47429368 * Add tests for non_sparse string transform Matthew Honnibal 2014-09-01 23:27:31 +0200
  • c50433163f * Add tests for flag features Matthew Honnibal 2014-09-01 23:27:09 +0200
  • 786a4a86fe * Add tests for canon_case Matthew Honnibal 2014-09-01 23:26:49 +0200
  • 4c7b997df7 * Add tests for word shape features Matthew Honnibal 2014-09-01 23:26:17 +0200
  • c5abb81f4c * Add incomplete tests of asciify function Matthew Honnibal 2014-09-01 23:25:51 +0200
  • 151aa14bba * Add asciify string transform, and other bits. Matthew Honnibal 2014-09-01 23:25:28 +0200
  • c4ba216642 * Switch canon_case to get value, to avoid keyerror Matthew Honnibal 2014-09-01 17:27:36 +0200
  • a779275a59 * Add canon_case function Matthew Honnibal 2014-08-30 20:57:43 +0200
  • 8bbfadfced * Pass tests. Need to implement more feature functions. Matthew Honnibal 2014-08-30 20:36:06 +0200
  • dcab14ede2 * Begin testing more functionality Matthew Honnibal 2014-08-30 19:01:15 +0200
  • 3e3ff99ca0 * Add orth features Matthew Honnibal 2014-08-30 19:01:00 +0200
  • 6209d94f83 * Add tests for word shape Matthew Honnibal 2014-08-30 19:00:10 +0200
  • 4e5b2d47e2 * More docs Matthew Honnibal 2014-08-29 03:01:40 +0200
  • 5233f110c4 * Adding PTB3 tokenizer back in, so can understand how much boilerplate is in the docs for multiple tokenizers Matthew Honnibal 2014-08-29 02:30:27 +0200
  • 45a22d6b2c * Docs coming together Matthew Honnibal 2014-08-29 01:59:23 +0200
  • c282e6d5fb * Redesign proceeding Matthew Honnibal 2014-08-28 19:45:09 +0200
  • fd4e61e58b * Fixed contraction tests. Need to correct problem with the way case stats and tag stats are supposed to work. Matthew Honnibal 2014-08-27 20:22:33 +0200
  • fdaf24604a * Basic punct tests updated and passing Matthew Honnibal 2014-08-27 19:38:57 +0200
  • 8d20617dfd * Whitespace Matthew Honnibal 2014-08-27 17:16:16 +0200
  • e9a62b6eba * Refactoring with Lexeme as a class now compiles. Basic design seems to work Matthew Honnibal 2014-08-27 17:15:39 +0200
  • 68bae2fec6 * More refactoring Matthew Honnibal 2014-08-25 16:42:22 +0200
  • 88095666dc * Remove Lexeme struct, preparing to rename Word to Lexeme. Matthew Honnibal 2014-08-24 19:24:42 +0200
  • ce59526011 * Add Word classes Matthew Honnibal 2014-08-24 18:14:08 +0200
  • 3b793cf4f7 * Tests passing for new Word object version Matthew Honnibal 2014-08-24 18:13:53 +0200
  • 9815c7649e * Refactor around Word objects, adapting tests. Tests passing, except for string views. Matthew Honnibal 2014-08-23 19:55:06 +0200
  • 4f01df9152 * Moving to Word objects in place of the Lexeme struct. Matthew Honnibal 2014-08-22 17:32:16 +0200
  • 782806df08 * Moving to Word objects in place of the Lexeme struct. Matthew Honnibal 2014-08-22 17:28:23 +0200
  • 47fbd0475a * Replace the use of dense_hash_map with Python dict Matthew Honnibal 2014-08-22 17:13:09 +0200
  • 6f83dca218 * Fix import for ptb tokenization test Matthew Honnibal 2014-08-22 17:05:44 +0200
  • e289896603 * Fix ptb3 module Matthew Honnibal 2014-08-22 16:35:48 +0200
  • a22101404a * Move en_ptb data Matthew Honnibal 2014-08-22 04:28:51 +0200
  • 89d6faa9c9 * Move en_ptb to ptb3 Matthew Honnibal 2014-08-22 04:23:24 +0200
  • 4bcdd6d31c * Further improvements to spacy docs, tweaks to code. Matthew Honnibal 2014-08-22 04:20:24 +0200
  • 4eb9c2b30f * Add overview doc Matthew Honnibal 2014-08-22 03:38:05 +0200
  • 07ecf5d2f4 * Fixed group_by, removed idea of general attr_of function. Matthew Honnibal 2014-08-22 00:02:37 +0200
  • 811b7a6b91 * Struggling with arbitrary attr access... Matthew Honnibal 2014-08-21 23:49:14 +0200
  • 314658b31c * Improve module docstring Matthew Honnibal 2014-08-21 18:42:47 +0200
  • 8bcd07dbae * More docs work Matthew Honnibal 2014-08-21 17:05:28 +0200
  • d10993f41a * More docs work Matthew Honnibal 2014-08-21 16:37:13 +0200
  • d5403a6fe3 * More docs work Matthew Honnibal 2014-08-21 16:37:06 +0200
  • 248cbb6d07 * Update doc strings Matthew Honnibal 2014-08-21 03:29:15 +0200
  • cbda38e2d9 * Improving docs Matthew Honnibal 2014-08-20 21:09:39 +0200
  • cab7f63fc2 * Temporarily remove sparsehash requirement Matthew Honnibal 2014-08-20 17:12:19 +0200
  • c289867d1f * Add draft sphinx docs files Matthew Honnibal 2014-08-20 17:05:18 +0200
  • bebfd7940d * Upd gitignore Matthew Honnibal 2014-08-20 17:04:33 +0200
  • 76afbd7d69 * Remove compiled orthography file Matthew Honnibal 2014-08-20 17:04:07 +0200
  • f39dcb1d89 * Add orthography Matthew Honnibal 2014-08-20 17:03:44 +0200
  • d42cdbb446 * Compile orthography.latin.pyx Matthew Honnibal 2014-08-20 17:03:19 +0200
  • 0c4c47b074 * Add docs requirements Matthew Honnibal 2014-08-20 17:02:54 +0200
  • bd51742fbd * Remove MurmurHash headers Matthew Honnibal 2014-08-20 17:02:47 +0200
  • 416a324bcf * Add documentation building to fabfile Matthew Honnibal 2014-08-20 17:02:32 +0200
  • a78ad4152d * Broken version being refactored for docs Matthew Honnibal 2014-08-20 13:39:39 +0200
  • 5fddb8d165 * Working refactor, with updated data model for Lexemes Matthew Honnibal 2014-08-19 04:21:20 +0200
  • 3379d7a571 * Reforming data model for lexemes Matthew Honnibal 2014-08-19 02:40:37 +0200
  • e091b6a241 * Base master on temp branch Matthew Honnibal 2014-08-18 23:29:21 +0200
  • 85df22c379 * Remove murmurhash from requirements Matthew Honnibal 2014-08-18 23:26:20 +0200
  • ab9b0daabf * Whitespace Matthew Honnibal 2014-08-18 23:21:49 +0200
  • a2047fa5aa * Add 's suffix to tokenization table Matthew Honnibal 2014-08-18 23:21:37 +0200
  • 1b71cbfe28 * Roll back to using unicode, and never Py_UNICODE. No dependence on murmurhash either. Matthew Honnibal 2014-08-18 20:48:48 +0200