Matthew Honnibal
|
b962fe73d7
|
* Make suffixes file use full-power regex, so that we can handle periods properly
|
2014-12-09 19:04:27 +11:00 |
|
Matthew Honnibal
|
1ccabc806e
|
* Work on lemmatization
|
2014-12-09 16:06:18 +11:00 |
|
Matthew Honnibal
|
677e111ee7
|
* Revise tokenization rules to match PTB. Rules are pretty messy around periods, need better support for these.
|
2014-12-07 22:04:47 +11:00 |
|
Matthew Honnibal
|
da70b6bd60
|
* Upd tokenization special-cases
|
2014-11-11 22:10:15 +11:00 |
|
Matthew Honnibal
|
bea762ec04
|
* Update tokenization rules
|
2014-11-04 01:06:00 +11:00 |
|
Matthew Honnibal
|
75329e9ef8
|
* Add Co. abbreviation to tokenization rules
|
2014-11-03 00:16:20 +11:00 |
|
Matthew Honnibal
|
fa91506073
|
* Add '' double quote to suffixes file
|
2014-11-03 00:12:59 +11:00 |
|
Matthew Honnibal
|
11e42fd070
|
* Add emoticons to tokenization
|
2014-11-01 15:14:55 +11:00 |
|
Matthew Honnibal
|
39743323ea
|
* Add i'ma to tokenization rules
|
2014-10-31 17:45:44 +11:00 |
|
Matthew Honnibal
|
849de654e7
|
* Add file for infix patterns
|
2014-10-14 20:26:43 +11:00 |
|
Matthew Honnibal
|
5abb194553
|
* Add semi-colon to suffix punct
|
2014-10-14 10:43:45 +11:00 |
|
Matthew Honnibal
|
c4cd3bc57a
|
* Add prefix and suffix data files
|
2014-09-25 18:24:52 +02:00 |
|
Matthew Honnibal
|
143e51ec73
|
* Refactor tokenization, splitting it into a clearer life-cycle.
|
2014-09-16 13:16:02 +02:00 |
|
Matthew Honnibal
|
6fc06bfe2f
|
* Hack a hard-cased unit in to get a test to pass
|
2014-09-15 06:31:35 +02:00 |
|
Matthew Honnibal
|
3b793cf4f7
|
* Tests passing for new Word object version
|
2014-08-24 18:13:53 +02:00 |
|
Matthew Honnibal
|
a22101404a
|
* Move en_ptb data
|
2014-08-22 04:28:51 +02:00 |
|
Matthew Honnibal
|
a2047fa5aa
|
* Add 's suffix to tokenization table
|
2014-08-18 23:21:37 +02:00 |
|
Matthew Honnibal
|
cc3971ce5c
|
* Fix error in tokenization rules
|
2014-07-07 05:09:34 +02:00 |
|
Matthew Honnibal
|
997551241f
|
* Upd ptb tokenization rules
|
2014-07-07 05:09:22 +02:00 |
|
Matthew Honnibal
|
df0458001d
|
* Begin work on full PTB-compatible English tokenization
|
2014-07-07 04:29:24 +02:00 |
|
Matthew Honnibal
|
d5bef02c72
|
* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals
|
2014-07-07 04:21:06 +02:00 |
|