Raphaël Bournhonesque
|
3452d6ce52
|
Resolve issue #1078 by simplifying URL pattern
- avoid catastrophic backtracking
- reduce character range of host name, domain name and TLD identifier
|
2017-10-11 11:24:00 +02:00 |
|
Matthew Honnibal
|
45029a550e
|
Fix customized-tokenizer tests
|
2017-09-04 20:13:13 +02:00 |
|
Matthew Honnibal
|
c68f188eb0
|
Fix error on test
|
2017-09-04 18:59:36 +02:00 |
|
Matthew Honnibal
|
9bffcaa73d
|
Update test to make it slightly more direct
The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.
|
2017-09-01 21:16:56 +02:00 |
|
Vimos Tan
|
a6d9fb5bb6
|
fix issue #1292
|
2017-08-30 14:49:14 +08:00 |
|
ines
|
0084466a66
|
Remove unused utf8open util and replace os.path with ensure_path
|
2017-04-16 20:37:45 +02:00 |
|
ines
|
444dd511c5
|
Fix xpassing URL test case
|
2017-04-07 17:36:05 +02:00 |
|
ines
|
10e29189ac
|
Adjust URL testcases and xfail problems (instead of comment)
|
2017-03-10 14:22:50 +01:00 |
|
Dan Rapp
|
123d3f2d38
|
Fix error in test case parameterization
|
2017-03-09 12:18:21 -07:00 |
|
Dan Rapp
|
b9307dfcd7
|
Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix
|
2017-03-09 11:42:14 -07:00 |
|
Dan Rapp
|
3b1df3808d
|
Issue #840 - URL pattenr too broad
|
2017-03-09 11:39:39 -07:00 |
|
Aniruddha Adhikary
|
696215a3fb
|
add tests for Bengali
|
2017-03-05 11:25:12 +06:00 |
|
Ines Montani
|
138c53ff2e
|
Merge tokenizer tests
|
2017-01-13 01:34:14 +01:00 |
|
Ines Montani
|
33e5f8dc2e
|
Create basic and extended test set for URLs
|
2017-01-12 23:40:02 +01:00 |
|
Ines Montani
|
ae7edd30e7
|
Move text file back to tokenizer tests directory
|
2017-01-12 02:10:23 +01:00 |
|
Ines Montani
|
c682b8ca90
|
Merge conftests into one cohesive file
|
2017-01-11 13:56:32 +01:00 |
|
Ines Montani
|
869963c3c4
|
Mark extensive prefix/suffix tests as slow
|
2017-01-10 15:57:35 +01:00 |
|
Ines Montani
|
487e020ebe
|
Add simple test for surrounding brackets
|
2017-01-10 15:57:26 +01:00 |
|
Ines Montani
|
0ba5cf51d2
|
Assert length first
|
2017-01-10 15:57:00 +01:00 |
|
Ines Montani
|
2185d31907
|
Adjust names and formatting
|
2017-01-10 15:56:35 +01:00 |
|
Ines Montani
|
e10d4ca964
|
Remove semi-redundant URLs and punctuation for faster testing
|
2017-01-10 15:54:25 +01:00 |
|
Ines Montani
|
3a3cb2c90c
|
Add unicode declaration
|
2017-01-10 15:53:15 +01:00 |
|
Matthew Honnibal
|
42cd598f57
|
Use correct fixtures in URL tokenizer
|
2017-01-09 14:10:40 +01:00 |
|
Ines Montani
|
aa876884f0
|
Revert "Revert "Merge remote-tracking branch 'origin/master'""
This reverts commit fb9d3bb022 .
|
2017-01-09 13:28:13 +01:00 |
|
Ines Montani
|
abb09782f9
|
Move sun.txt to original location and fix path to not break parser tests
|
2017-01-08 20:32:54 +01:00 |
|
Ines Montani
|
bbe7cab3a1
|
Move non-English-specific tests back to general tokenizer tests
|
2017-01-05 18:09:29 +01:00 |
|
Ines Montani
|
637f785036
|
Add general sanity tests for all tokenizers
|
2017-01-05 16:25:38 +01:00 |
|
Ines Montani
|
c5f2dc15de
|
Move English tokenizer tests to directory /en
|
2017-01-05 16:25:04 +01:00 |
|
Ines Montani
|
8b45363b4d
|
Modernize and merge general tokenizer tests
|
2017-01-05 13:17:05 +01:00 |
|
Ines Montani
|
02cfda48c9
|
Modernize and merge tokenizer tests for string loading
|
2017-01-05 13:16:55 +01:00 |
|
Ines Montani
|
a11f684822
|
Modernize and merge tokenizer tests for whitespace
|
2017-01-05 13:16:33 +01:00 |
|
Ines Montani
|
8b284fc6f1
|
Modernize and merge tokenizer tests for text from file
|
2017-01-05 13:15:52 +01:00 |
|
Ines Montani
|
2c2e878653
|
Modernize and merge tokenizer tests for punctuation
|
2017-01-05 13:14:16 +01:00 |
|
Ines Montani
|
8a74129cdf
|
Modernize and merge tokenizer tests for prefixes/suffixes/infixes
|
2017-01-05 13:13:12 +01:00 |
|
Ines Montani
|
0e65dca9a5
|
Modernize and merge tokenizer tests for exception and emoticons
|
2017-01-05 13:11:31 +01:00 |
|
Ines Montani
|
34c47bb20d
|
Fix formatting
|
2017-01-05 13:10:51 +01:00 |
|
Ines Montani
|
2e72683baa
|
Add missing docstrings
|
2017-01-05 13:10:21 +01:00 |
|
Ines Montani
|
da10a049a6
|
Add unicode declarations
|
2017-01-05 13:09:48 +01:00 |
|
Ines Montani
|
8279993a6f
|
Modernize and merge tokenizer tests for punctuation
|
2017-01-04 00:49:20 +01:00 |
|
Ines Montani
|
550630df73
|
Update tokenizer tests for contractions
|
2017-01-04 00:48:42 +01:00 |
|
Ines Montani
|
109f202e8f
|
Update conftest fixture
|
2017-01-04 00:48:21 +01:00 |
|
Ines Montani
|
ee6b49b293
|
Modernize tokenizer tests for emoticons
|
2017-01-04 00:47:59 +01:00 |
|
Ines Montani
|
f09b5a5dfd
|
Modernize tokenizer tests for infixes
|
2017-01-04 00:47:42 +01:00 |
|
Ines Montani
|
59059fed27
|
Move regression test for #351 to own file
|
2017-01-04 00:47:11 +01:00 |
|
Ines Montani
|
667051375d
|
Modernize tokenizer tests for whitespace
|
2017-01-04 00:46:35 +01:00 |
|
Ines Montani
|
aafc894285
|
Modernize tokenizer tests for contractions
Use @pytest.mark.parametrize.
|
2017-01-03 23:02:21 +01:00 |
|
Ines Montani
|
fb9d3bb022
|
Revert "Merge remote-tracking branch 'origin/master'"
This reverts commit d3b181cdf1 , reversing
changes made to b19cfcc144 .
|
2017-01-03 18:21:36 +01:00 |
|
Matthew Honnibal
|
3ba7c167a8
|
Fix URL tests
|
2016-12-30 17:10:08 -06:00 |
|
Matthew Honnibal
|
3e8d9c772e
|
Test interaction of token_match and punctuation
Check that the new token_match function applies after punctuation is split off.
|
2016-12-31 00:52:17 +11:00 |
|
Gyorgy Orosz
|
1748549aeb
|
Added exception pattern mechanism to the tokenizer.
|
2016-12-21 23:16:19 +01:00 |
|