Commit Graph

67 Commits

Author SHA1 Message Date
Raphaël Bournhonesque
3452d6ce52 Resolve issue #1078 by simplifying URL pattern
- avoid catastrophic backtracking
- reduce character range of host name, domain name and TLD identifier
2017-10-11 11:24:00 +02:00
Matthew Honnibal
45029a550e Fix customized-tokenizer tests 2017-09-04 20:13:13 +02:00
Matthew Honnibal
c68f188eb0 Fix error on test 2017-09-04 18:59:36 +02:00
Matthew Honnibal
9bffcaa73d Update test to make it slightly more direct
The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.
2017-09-01 21:16:56 +02:00
Vimos Tan
a6d9fb5bb6 fix issue #1292 2017-08-30 14:49:14 +08:00
ines
0084466a66 Remove unused utf8open util and replace os.path with ensure_path 2017-04-16 20:37:45 +02:00
ines
444dd511c5 Fix xpassing URL test case 2017-04-07 17:36:05 +02:00
ines
10e29189ac Adjust URL testcases and xfail problems (instead of comment) 2017-03-10 14:22:50 +01:00
Dan Rapp
123d3f2d38 Fix error in test case parameterization 2017-03-09 12:18:21 -07:00
Dan Rapp
b9307dfcd7 Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix 2017-03-09 11:42:14 -07:00
Dan Rapp
3b1df3808d Issue #840 - URL pattenr too broad 2017-03-09 11:39:39 -07:00
Aniruddha Adhikary
696215a3fb add tests for Bengali 2017-03-05 11:25:12 +06:00
Ines Montani
138c53ff2e Merge tokenizer tests 2017-01-13 01:34:14 +01:00
Ines Montani
33e5f8dc2e Create basic and extended test set for URLs 2017-01-12 23:40:02 +01:00
Ines Montani
ae7edd30e7 Move text file back to tokenizer tests directory 2017-01-12 02:10:23 +01:00
Ines Montani
c682b8ca90 Merge conftests into one cohesive file 2017-01-11 13:56:32 +01:00
Ines Montani
869963c3c4 Mark extensive prefix/suffix tests as slow 2017-01-10 15:57:35 +01:00
Ines Montani
487e020ebe Add simple test for surrounding brackets 2017-01-10 15:57:26 +01:00
Ines Montani
0ba5cf51d2 Assert length first 2017-01-10 15:57:00 +01:00
Ines Montani
2185d31907 Adjust names and formatting 2017-01-10 15:56:35 +01:00
Ines Montani
e10d4ca964 Remove semi-redundant URLs and punctuation for faster testing 2017-01-10 15:54:25 +01:00
Ines Montani
3a3cb2c90c Add unicode declaration 2017-01-10 15:53:15 +01:00
Matthew Honnibal
42cd598f57 Use correct fixtures in URL tokenizer 2017-01-09 14:10:40 +01:00
Ines Montani
aa876884f0 Revert "Revert "Merge remote-tracking branch 'origin/master'""
This reverts commit fb9d3bb022.
2017-01-09 13:28:13 +01:00
Ines Montani
abb09782f9 Move sun.txt to original location and fix path to not break parser tests 2017-01-08 20:32:54 +01:00
Ines Montani
bbe7cab3a1 Move non-English-specific tests back to general tokenizer tests 2017-01-05 18:09:29 +01:00
Ines Montani
637f785036 Add general sanity tests for all tokenizers 2017-01-05 16:25:38 +01:00
Ines Montani
c5f2dc15de Move English tokenizer tests to directory /en 2017-01-05 16:25:04 +01:00
Ines Montani
8b45363b4d Modernize and merge general tokenizer tests 2017-01-05 13:17:05 +01:00
Ines Montani
02cfda48c9 Modernize and merge tokenizer tests for string loading 2017-01-05 13:16:55 +01:00
Ines Montani
a11f684822 Modernize and merge tokenizer tests for whitespace 2017-01-05 13:16:33 +01:00
Ines Montani
8b284fc6f1 Modernize and merge tokenizer tests for text from file 2017-01-05 13:15:52 +01:00
Ines Montani
2c2e878653 Modernize and merge tokenizer tests for punctuation 2017-01-05 13:14:16 +01:00
Ines Montani
8a74129cdf Modernize and merge tokenizer tests for prefixes/suffixes/infixes 2017-01-05 13:13:12 +01:00
Ines Montani
0e65dca9a5 Modernize and merge tokenizer tests for exception and emoticons 2017-01-05 13:11:31 +01:00
Ines Montani
34c47bb20d Fix formatting 2017-01-05 13:10:51 +01:00
Ines Montani
2e72683baa Add missing docstrings 2017-01-05 13:10:21 +01:00
Ines Montani
da10a049a6 Add unicode declarations 2017-01-05 13:09:48 +01:00
Ines Montani
8279993a6f Modernize and merge tokenizer tests for punctuation 2017-01-04 00:49:20 +01:00
Ines Montani
550630df73 Update tokenizer tests for contractions 2017-01-04 00:48:42 +01:00
Ines Montani
109f202e8f Update conftest fixture 2017-01-04 00:48:21 +01:00
Ines Montani
ee6b49b293 Modernize tokenizer tests for emoticons 2017-01-04 00:47:59 +01:00
Ines Montani
f09b5a5dfd Modernize tokenizer tests for infixes 2017-01-04 00:47:42 +01:00
Ines Montani
59059fed27 Move regression test for #351 to own file 2017-01-04 00:47:11 +01:00
Ines Montani
667051375d Modernize tokenizer tests for whitespace 2017-01-04 00:46:35 +01:00
Ines Montani
aafc894285 Modernize tokenizer tests for contractions
Use @pytest.mark.parametrize.
2017-01-03 23:02:21 +01:00
Ines Montani
fb9d3bb022 Revert "Merge remote-tracking branch 'origin/master'"
This reverts commit d3b181cdf1, reversing
changes made to b19cfcc144.
2017-01-03 18:21:36 +01:00
Matthew Honnibal
3ba7c167a8 Fix URL tests 2016-12-30 17:10:08 -06:00
Matthew Honnibal
3e8d9c772e Test interaction of token_match and punctuation
Check that the new token_match function applies after punctuation is split off.
2016-12-31 00:52:17 +11:00
Gyorgy Orosz
1748549aeb Added exception pattern mechanism to the tokenizer. 2016-12-21 23:16:19 +01:00