Commit Graph

75 Commits

Author SHA1 Message Date
adrianeboyd
cbc2cee2c8 Improve URL_PATTERN and handling in tokenizer (#4374)
* Move prefix and suffix detection for URL_PATTERN

Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.

Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.

* Match a wider range of URI schemes
2019-10-05 13:00:09 +02:00
Ines Montani
cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Matthew Honnibal
c5f947f194 Fix regex deprecation warnings 2019-02-21 11:56:47 +01:00
Sofie
46dfe773e1 Replacing regex library with re to increase tokenization speed (#3218)
* replace unicode categories with raw list of code points

* simplifying ranges

* fixing variable length quotes

* removing redundant regular expression

* small cleanup of regexp notations

* quotes and alpha as ranges instead of alterations

* removed most regexp dependencies and features

* exponential backtracking - unit tests

* rewrote expression with pathological backtracking

* disabling double hyphen tests for now

* test additional variants of repeating punctuation

* remove regex and redundant backslashes from load_reddit script

* small typo fixes

* disable double punctuation test for russian

* clean up old comments

* format block code

* final cleanup

* naming consistency

* french strings as unicode for python 2 support

* french regular expression case insensitive
2019-02-01 18:05:22 +11:00
Ines Montani
b6e991440c 💫 Tidy up and auto-format tests (#2967)
* Auto-format tests with black

* Add flake8 config

* Tidy up and remove unused imports

* Fix redefinitions of test functions

* Replace orths_and_spaces with words and spaces

* Fix compatibility with pytest 4.0

* xfail test for now

Test was previously overwritten by following test due to naming conflict, so failure wasn't reported

* Unfail passing test

* Only use fixture via arguments

Fixes pytest 4.0 compatibility
2018-11-27 01:09:36 +01:00
Ines Montani
75f3234404
💫 Refactor test suite (#2568)
## Description

Related issues: #2379 (should be fixed by separating model tests)

* **total execution time down from > 300 seconds to under 60 seconds** 🎉
* removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure
* changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version)
* merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways)
* tidied up and rewrote existing tests wherever possible

### Todo

- [ ] move tests to `/tests` and adjust CI commands accordingly
- [x] move model test suite from internal repo to `spacy-models`
- [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~
- [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted
- [ ] update documentation on how to run tests


### Types of change
enhancement, tests

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:38:44 +02:00
ines
a4d974d97b Port over URL pattern changes from #1411 2017-10-14 12:58:07 +02:00
Matthew Honnibal
8c945310fb Excuse emoji failure on narrow unicode builds 2017-09-16 16:21:13 +02:00
ines
34a2eecb17 Add simple "naughty strings" test (see #1107) 2017-06-06 17:43:51 +02:00
Gyorgy Orosz
8c0b4b850e Fixed emoji handling for Hungarian 2017-05-30 21:34:46 +02:00
ines
a8e58e04ef Add symbols class to punctuation rules to handle emoji (see #1088)
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽‍💻 into account.
2017-05-27 17:57:10 +02:00
ines
0084466a66 Remove unused utf8open util and replace os.path with ensure_path 2017-04-16 20:37:45 +02:00
ines
444dd511c5 Fix xpassing URL test case 2017-04-07 17:36:05 +02:00
ines
10e29189ac Adjust URL testcases and xfail problems (instead of comment) 2017-03-10 14:22:50 +01:00
Dan Rapp
123d3f2d38 Fix error in test case parameterization 2017-03-09 12:18:21 -07:00
Dan Rapp
b9307dfcd7 Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix 2017-03-09 11:42:14 -07:00
Dan Rapp
3b1df3808d Issue #840 - URL pattenr too broad 2017-03-09 11:39:39 -07:00
Aniruddha Adhikary
696215a3fb add tests for Bengali 2017-03-05 11:25:12 +06:00
Ines Montani
138c53ff2e Merge tokenizer tests 2017-01-13 01:34:14 +01:00
Ines Montani
33e5f8dc2e Create basic and extended test set for URLs 2017-01-12 23:40:02 +01:00
Ines Montani
ae7edd30e7 Move text file back to tokenizer tests directory 2017-01-12 02:10:23 +01:00
Ines Montani
c682b8ca90 Merge conftests into one cohesive file 2017-01-11 13:56:32 +01:00
Ines Montani
869963c3c4 Mark extensive prefix/suffix tests as slow 2017-01-10 15:57:35 +01:00
Ines Montani
487e020ebe Add simple test for surrounding brackets 2017-01-10 15:57:26 +01:00
Ines Montani
0ba5cf51d2 Assert length first 2017-01-10 15:57:00 +01:00
Ines Montani
2185d31907 Adjust names and formatting 2017-01-10 15:56:35 +01:00
Ines Montani
e10d4ca964 Remove semi-redundant URLs and punctuation for faster testing 2017-01-10 15:54:25 +01:00
Ines Montani
3a3cb2c90c Add unicode declaration 2017-01-10 15:53:15 +01:00
Matthew Honnibal
42cd598f57 Use correct fixtures in URL tokenizer 2017-01-09 14:10:40 +01:00
Ines Montani
aa876884f0 Revert "Revert "Merge remote-tracking branch 'origin/master'""
This reverts commit fb9d3bb022.
2017-01-09 13:28:13 +01:00
Ines Montani
abb09782f9 Move sun.txt to original location and fix path to not break parser tests 2017-01-08 20:32:54 +01:00
Ines Montani
bbe7cab3a1 Move non-English-specific tests back to general tokenizer tests 2017-01-05 18:09:29 +01:00
Ines Montani
637f785036 Add general sanity tests for all tokenizers 2017-01-05 16:25:38 +01:00
Ines Montani
c5f2dc15de Move English tokenizer tests to directory /en 2017-01-05 16:25:04 +01:00
Ines Montani
8b45363b4d Modernize and merge general tokenizer tests 2017-01-05 13:17:05 +01:00
Ines Montani
02cfda48c9 Modernize and merge tokenizer tests for string loading 2017-01-05 13:16:55 +01:00
Ines Montani
a11f684822 Modernize and merge tokenizer tests for whitespace 2017-01-05 13:16:33 +01:00
Ines Montani
8b284fc6f1 Modernize and merge tokenizer tests for text from file 2017-01-05 13:15:52 +01:00
Ines Montani
2c2e878653 Modernize and merge tokenizer tests for punctuation 2017-01-05 13:14:16 +01:00
Ines Montani
8a74129cdf Modernize and merge tokenizer tests for prefixes/suffixes/infixes 2017-01-05 13:13:12 +01:00
Ines Montani
0e65dca9a5 Modernize and merge tokenizer tests for exception and emoticons 2017-01-05 13:11:31 +01:00
Ines Montani
34c47bb20d Fix formatting 2017-01-05 13:10:51 +01:00
Ines Montani
2e72683baa Add missing docstrings 2017-01-05 13:10:21 +01:00
Ines Montani
da10a049a6 Add unicode declarations 2017-01-05 13:09:48 +01:00
Ines Montani
8279993a6f Modernize and merge tokenizer tests for punctuation 2017-01-04 00:49:20 +01:00
Ines Montani
550630df73 Update tokenizer tests for contractions 2017-01-04 00:48:42 +01:00
Ines Montani
109f202e8f Update conftest fixture 2017-01-04 00:48:21 +01:00
Ines Montani
ee6b49b293 Modernize tokenizer tests for emoticons 2017-01-04 00:47:59 +01:00