* Update tokenization rules

This commit is contained in:
Matthew Honnibal 2014-11-04 01:06:00 +11:00
parent b8d5881333
commit bea762ec04
2 changed files with 96 additions and 1 deletions

View File

@ -1,2 +1 @@
(?<=[^-])-(?=\w)
(?<=[a-z])\.(?=[A-Z]) (?<=[a-z])\.(?=[A-Z])

View File

@ -9,6 +9,7 @@
ain't are not ain't are not
aren't are not aren't are not
can't can not can't can not
cannot can not
could've could have could've could have
couldn't could not couldn't could not
couldn't've could not have couldn't've could not have
@ -94,13 +95,108 @@ you'd've you would have
you'll you will you'll you will
you're you are you're you are
you've you have you've you have
'em them
'ol old
10km 10 km 10km 10 km
U.S. U.S. U.S. U.S.
non-U.S. non-U.S.
U.N. U.N. U.N. U.N.
Co. Co. Co. Co.
Corp. Corp.
Inc. Inc.
Rep. Rep.
Ms. Ms. Ms. Ms.
Mr. Mr. Mr. Mr.
a.m. a.m.
p.m. p.m.
Nos. Nos.
a.k.a. a.k.a.
A. A.
B. B.
C. C.
D. D.
E. E.
F. F.
G. G.
H. H.
J. J.
K. K.
L. L.
M. M.
N. N.
O. O.
P. P. P. P.
Q. Q.
R. R.
S. S.
T. T.
U. U.
V. V.
W. W.
X. X.
Y. Y.
Z. Z.
Jan. Jan.
Feb. Feb.
Mar. Mar.
Apr. Apr.
May. May.
Jun. Jun.
Jul. Jul.
Aug. Aug.
Sep. Sep.
Sept. Sept.
Oct. Oct.
Nov. Nov.
Dec. Dec.
N.V. N.V.
Ala. Ala.
Ariz. Ariz.
Ark. Ark.
Calif. Calif.
Colo. Colo.
Conn. Conn.
Del. Del.
D.C. D.C.
Fla. Fla.
Ga. Ga.
Ill. Ill.
Ind. Ind.
Kans. Kans.
Kan. Kan.
Ky. Ky.
La. La.
Md. Md.
Mass. Mass.
Mich. Mich.
Minn. Minn.
Miss. Miss.
Mo. Mo.
Mont. Mont.
Nebr. Nebr.
Nev. Nev.
N.H. N.H.
N.J. N.J.
N.M. N.M.
N.Y. N.Y.
N.C. N.C.
N.D. N.D.
Okla. Okla.
Ore. Ore.
Pa. Pa.
P.R. P.R.
R.I. R.I.
S.C. S.C.
S.D. S.D.
Tenn. Tenn.
Tex. Tex.
Vt. Vt.
Va. Va.
V.I. V.I.
Wash. Wash.
W.Va. W.Va.
Wis. Wis.
Wyo. Wyo.
'' '' '' ''
:) :) :) :)
<3 <3 <3 <3