Ines Montani
|
363f09e68c
|
Merge pull request #726 from magnusburton/master
Added Swedish abbreviations as token exceptions
|
2017-01-09 14:58:15 +01:00 |
|
Matthew Honnibal
|
42cd598f57
|
Use correct fixtures in URL tokenizer
|
2017-01-09 14:10:40 +01:00 |
|
Matthew Honnibal
|
d9a77ddf14
|
Return None for data path if it doesn't exist
|
2017-01-09 14:10:05 +01:00 |
|
Matthew Honnibal
|
e4862d1dab
|
Merge branch 'develop'
|
2017-01-09 13:36:01 +01:00 |
|
Ines Montani
|
aa876884f0
|
Revert "Revert "Merge remote-tracking branch 'origin/master'""
This reverts commit fb9d3bb022 .
|
2017-01-09 13:28:13 +01:00 |
|
Ines Montani
|
d5c72c40eb
|
Remove old tests for old website example code
|
2017-01-08 22:28:53 +01:00 |
|
Ines Montani
|
eef94e3ee2
|
Split off period after two or more uppercase letters (fixes #483)
|
2017-01-08 22:28:25 +01:00 |
|
Ines Montani
|
a89a6000e5
|
Remove unused import
|
2017-01-08 22:17:37 +01:00 |
|
Ines Montani
|
5d28664fc5
|
Don't test Hungarian for numbers and hyphens for now
Reinvestigate behaviour of case affixes given reorganised tokenizer
patterns.
|
2017-01-08 20:45:40 +01:00 |
|
Ines Montani
|
53362b6b93
|
Reorganise Hungarian prefixes/suffixes/infixes
Use global prefixes and suffixes for non-language-specific rules,
import list of alpha unicode characters and adjust regexes.
|
2017-01-08 20:40:33 +01:00 |
|
Ines Montani
|
347c4a2d06
|
Reorganise and reformat global tokenizer prefixes, suffixes and infixes
|
2017-01-08 20:37:39 +01:00 |
|
Ines Montani
|
0dec90e9f7
|
Use global abbreviation data languages and remove duplicates
|
2017-01-08 20:36:00 +01:00 |
|
Ines Montani
|
7c3cb2a652
|
Add global abbreviations data
|
2017-01-08 20:34:03 +01:00 |
|
Ines Montani
|
de5aa92bc2
|
Handle deprecated tokenizer prefix data
|
2017-01-08 20:33:28 +01:00 |
|
Ines Montani
|
abb09782f9
|
Move sun.txt to original location and fix path to not break parser tests
|
2017-01-08 20:32:54 +01:00 |
|
Ines Montani
|
cab39c59c5
|
Add missing contractions to English tokenizer exceptions
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
|
2017-01-05 19:59:06 +01:00 |
|
Ines Montani
|
a23504fe07
|
Move abbreviations below other exceptions
|
2017-01-05 19:58:07 +01:00 |
|
Ines Montani
|
7d2cf934b9
|
Generate he/she/it correctly with 's instead of 've
|
2017-01-05 19:57:00 +01:00 |
|
Ines Montani
|
8328925e1f
|
Add newlines to long German text
|
2017-01-05 18:13:30 +01:00 |
|
Ines Montani
|
55b46d7cf6
|
Add tokenizer tests for German
|
2017-01-05 18:11:25 +01:00 |
|
Ines Montani
|
5bb4081f52
|
Remove redundant test_tokenizer.py for English
|
2017-01-05 18:11:11 +01:00 |
|
Ines Montani
|
8216ba599b
|
Add tests for longer and mixed English texts
|
2017-01-05 18:11:04 +01:00 |
|
Ines Montani
|
65f937d5c6
|
Move basic contraction tests to test_contractions.py
|
2017-01-05 18:09:53 +01:00 |
|
Ines Montani
|
bbe7cab3a1
|
Move non-English-specific tests back to general tokenizer tests
|
2017-01-05 18:09:29 +01:00 |
|
Ines Montani
|
038002d616
|
Reformat HU tokenizer tests and adapt to general style
Improve readability of test cases and add conftest.py with fixture
|
2017-01-05 18:06:44 +01:00 |
|
Ines Montani
|
bc911322b3
|
Move ") to emoticons (see Tweebo challenge test)
|
2017-01-05 18:05:38 +01:00 |
|
Ines Montani
|
637f785036
|
Add general sanity tests for all tokenizers
|
2017-01-05 16:25:38 +01:00 |
|
Ines Montani
|
c5f2dc15de
|
Move English tokenizer tests to directory /en
|
2017-01-05 16:25:04 +01:00 |
|
Ines Montani
|
8b45363b4d
|
Modernize and merge general tokenizer tests
|
2017-01-05 13:17:05 +01:00 |
|
Ines Montani
|
02cfda48c9
|
Modernize and merge tokenizer tests for string loading
|
2017-01-05 13:16:55 +01:00 |
|
Ines Montani
|
a11f684822
|
Modernize and merge tokenizer tests for whitespace
|
2017-01-05 13:16:33 +01:00 |
|
Ines Montani
|
8b284fc6f1
|
Modernize and merge tokenizer tests for text from file
|
2017-01-05 13:15:52 +01:00 |
|
Ines Montani
|
2c2e878653
|
Modernize and merge tokenizer tests for punctuation
|
2017-01-05 13:14:16 +01:00 |
|
Ines Montani
|
8a74129cdf
|
Modernize and merge tokenizer tests for prefixes/suffixes/infixes
|
2017-01-05 13:13:12 +01:00 |
|
Ines Montani
|
0e65dca9a5
|
Modernize and merge tokenizer tests for exception and emoticons
|
2017-01-05 13:11:31 +01:00 |
|
Ines Montani
|
34c47bb20d
|
Fix formatting
|
2017-01-05 13:10:51 +01:00 |
|
Ines Montani
|
2e72683baa
|
Add missing docstrings
|
2017-01-05 13:10:21 +01:00 |
|
Ines Montani
|
da10a049a6
|
Add unicode declarations
|
2017-01-05 13:09:48 +01:00 |
|
Ines Montani
|
58adae8774
|
Remove unused file
|
2017-01-05 13:09:22 +01:00 |
|
Ines Montani
|
c6e5a5349d
|
Move regression test for #360 into own file
|
2017-01-04 00:49:31 +01:00 |
|
Ines Montani
|
8279993a6f
|
Modernize and merge tokenizer tests for punctuation
|
2017-01-04 00:49:20 +01:00 |
|
Ines Montani
|
550630df73
|
Update tokenizer tests for contractions
|
2017-01-04 00:48:42 +01:00 |
|
Ines Montani
|
109f202e8f
|
Update conftest fixture
|
2017-01-04 00:48:21 +01:00 |
|
Ines Montani
|
ee6b49b293
|
Modernize tokenizer tests for emoticons
|
2017-01-04 00:47:59 +01:00 |
|
Ines Montani
|
f09b5a5dfd
|
Modernize tokenizer tests for infixes
|
2017-01-04 00:47:42 +01:00 |
|
Ines Montani
|
59059fed27
|
Move regression test for #351 to own file
|
2017-01-04 00:47:11 +01:00 |
|
Ines Montani
|
667051375d
|
Modernize tokenizer tests for whitespace
|
2017-01-04 00:46:35 +01:00 |
|
Ines Montani
|
aafc894285
|
Modernize tokenizer tests for contractions
Use @pytest.mark.parametrize.
|
2017-01-03 23:02:21 +01:00 |
|
Ines Montani
|
1d237664af
|
Add lowercase lemma to tokenizer exceptions
|
2017-01-03 23:02:21 +01:00 |
|
Ines Montani
|
84a87951eb
|
Fix typos
|
2017-01-03 18:27:43 +01:00 |
|