Dan Rapp
|
3b1df3808d
|
Issue #840 - URL pattenr too broad
|
2017-03-09 11:39:39 -07:00 |
|
Ines Montani
|
012f4820cb
|
Keep infixes of punctuation + hyphens as one token (see #801)
|
2017-02-02 16:22:40 +01:00 |
|
Ines Montani
|
1219a5f513
|
Add = to tokenizer prefixes
|
2017-02-02 16:21:11 +01:00 |
|
Ines Montani
|
ff04748eb6
|
Add missing emoticon
|
2017-02-02 16:21:00 +01:00 |
|
Ines Montani
|
116c675c3c
|
Merge pull request #742 from oroszgy/hu_tokenizer_fix
Improved Hungarian tokenizer
|
2017-01-14 23:52:44 +01:00 |
|
Gyorgy Orosz
|
63037e79af
|
Fixed hyphen handling in the Hungarian tokenizer.
|
2017-01-14 16:30:11 +01:00 |
|
Gyorgy Orosz
|
be7a7aeb1a
|
Reversed accidental changes.
|
2017-01-14 15:59:36 +01:00 |
|
Gyorgy Orosz
|
1be5da1ac6
|
Fixed Hungarian tokenizer for numbers
|
2017-01-14 15:51:59 +01:00 |
|
Ines Montani
|
0894b8c0ef
|
Don't split tokens with digits and "/" infixes (resolves #740)
|
2017-01-12 22:58:26 +01:00 |
|
Matthew Honnibal
|
fba67fa342
|
Fix Issue #736: Times were being tokenized with incorrect string values.
|
2017-01-12 11:21:01 +01:00 |
|
Ines Montani
|
aa876884f0
|
Revert "Revert "Merge remote-tracking branch 'origin/master'""
This reverts commit fb9d3bb022 .
|
2017-01-09 13:28:13 +01:00 |
|
Ines Montani
|
eef94e3ee2
|
Split off period after two or more uppercase letters (fixes #483)
|
2017-01-08 22:28:25 +01:00 |
|
Ines Montani
|
347c4a2d06
|
Reorganise and reformat global tokenizer prefixes, suffixes and infixes
|
2017-01-08 20:37:39 +01:00 |
|
Ines Montani
|
7c3cb2a652
|
Add global abbreviations data
|
2017-01-08 20:34:03 +01:00 |
|
Ines Montani
|
bc911322b3
|
Move ") to emoticons (see Tweebo challenge test)
|
2017-01-05 18:05:38 +01:00 |
|
Ines Montani
|
fb9d3bb022
|
Revert "Merge remote-tracking branch 'origin/master'"
This reverts commit d3b181cdf1 , reversing
changes made to b19cfcc144 .
|
2017-01-03 18:21:36 +01:00 |
|
Matthew Honnibal
|
9936a1b9b5
|
Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns
|
2016-12-30 14:53:40 -06:00 |
|
Petter Hohle
|
f112e7754e
|
Add PART to tag map
16 of the 17 PoS tags in the UD tag set is added; PART is missing.
|
2016-12-28 18:39:01 +01:00 |
|
Gyorgy Orosz
|
3a9be4d485
|
Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
|
2016-12-23 23:49:34 +01:00 |
|
Gyorgy Orosz
|
1748549aeb
|
Added exception pattern mechanism to the tokenizer.
|
2016-12-21 23:16:19 +01:00 |
|
Ines Montani
|
920fa0fed2
|
Add DET_LEMMA constant
|
2016-12-21 18:05:41 +01:00 |
|
Ines Montani
|
4e95737c6c
|
Add base tag map
|
2016-12-18 16:54:28 +01:00 |
|
Ines Montani
|
2b2ea8ca11
|
Reorganise language data
|
2016-12-18 16:54:19 +01:00 |
|
Ines Montani
|
bc40dad7d9
|
Add entity rules
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
eaa3b1319d
|
Fix formatting
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
62655fd36f
|
Add ENT_ID constant
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
f324311249
|
Add global language data utils
|
2016-12-17 12:27:41 +01:00 |
|
Ines Montani
|
e47ee94761
|
Split punctuation into its own file
|
2016-12-08 19:46:43 +01:00 |
|
Ines Montani
|
e8ae588be9
|
Add emoticons
|
2016-12-08 19:45:18 +01:00 |
|
Ines Montani
|
5908c0ed9f
|
Fix formatting
|
2016-12-08 19:45:11 +01:00 |
|
Ines Montani
|
0d07d7fc80
|
Apply emoticon exceptions to tokenizer
|
2016-12-07 21:11:59 +01:00 |
|
Ines Montani
|
9413bcd9ee
|
Declare encoding and unicode literals
|
2016-12-07 21:10:34 +01:00 |
|
Ines Montani
|
a280ff2657
|
Fix __all__
|
2016-12-07 21:10:12 +01:00 |
|
Ines Montani
|
ba8721953c
|
Add missing emoticons
|
2016-12-07 21:09:44 +01:00 |
|
Ines Montani
|
79dce0aabe
|
Add emoticons
|
2016-12-07 20:33:28 +01:00 |
|