Commit Graph

19 Commits

Author SHA1 Message Date
Matthew Honnibal
677e111ee7 * Revise tokenization rules to match PTB. Rules are pretty messy around periods, need better support for these. 2014-12-07 22:04:47 +11:00
Matthew Honnibal
da70b6bd60 * Upd tokenization special-cases 2014-11-11 22:10:15 +11:00
Matthew Honnibal
bea762ec04 * Update tokenization rules 2014-11-04 01:06:00 +11:00
Matthew Honnibal
75329e9ef8 * Add Co. abbreviation to tokenization rules 2014-11-03 00:16:20 +11:00
Matthew Honnibal
fa91506073 * Add '' double quote to suffixes file 2014-11-03 00:12:59 +11:00
Matthew Honnibal
11e42fd070 * Add emoticons to tokenization 2014-11-01 15:14:55 +11:00
Matthew Honnibal
39743323ea * Add i'ma to tokenization rules 2014-10-31 17:45:44 +11:00
Matthew Honnibal
849de654e7 * Add file for infix patterns 2014-10-14 20:26:43 +11:00
Matthew Honnibal
5abb194553 * Add semi-colon to suffix punct 2014-10-14 10:43:45 +11:00
Matthew Honnibal
c4cd3bc57a * Add prefix and suffix data files 2014-09-25 18:24:52 +02:00
Matthew Honnibal
143e51ec73 * Refactor tokenization, splitting it into a clearer life-cycle. 2014-09-16 13:16:02 +02:00
Matthew Honnibal
6fc06bfe2f * Hack a hard-cased unit in to get a test to pass 2014-09-15 06:31:35 +02:00
Matthew Honnibal
3b793cf4f7 * Tests passing for new Word object version 2014-08-24 18:13:53 +02:00
Matthew Honnibal
a22101404a * Move en_ptb data 2014-08-22 04:28:51 +02:00
Matthew Honnibal
a2047fa5aa * Add 's suffix to tokenization table 2014-08-18 23:21:37 +02:00
Matthew Honnibal
cc3971ce5c * Fix error in tokenization rules 2014-07-07 05:09:34 +02:00
Matthew Honnibal
997551241f * Upd ptb tokenization rules 2014-07-07 05:09:22 +02:00
Matthew Honnibal
df0458001d * Begin work on full PTB-compatible English tokenization 2014-07-07 04:29:24 +02:00
Matthew Honnibal
d5bef02c72 * Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals 2014-07-07 04:21:06 +02:00