Commit Graph

16 Commits

Author SHA1 Message Date
Ines Montani
b9c71fc0f0 Fix flags 2019-03-09 02:46:04 +01:00
Ines Montani
ae09b6a6cf Try fixing unicode inconsistencies on Python 2 2019-03-09 02:37:50 +01:00
Ines Montani
d957d7a697 Auto-format 2019-03-09 02:37:41 +01:00
Sofie
9a478b6db8 Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293)
* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* remove duplicate

* remove xfail for Issue #2179 fixed by Matt

* adjust documentation and remove reference to regex lib
2019-02-20 22:10:13 +01:00
Sofie
a3efa3e8d9 Improve Catalan tokenization accuracy (#3225)
* small hyphen clean up for French

* catalan infix similar to french
2019-02-04 20:37:19 +11:00
Sofie
46dfe773e1 Replacing regex library with re to increase tokenization speed (#3218)
* replace unicode categories with raw list of code points

* simplifying ranges

* fixing variable length quotes

* removing redundant regular expression

* small cleanup of regexp notations

* quotes and alpha as ranges instead of alterations

* removed most regexp dependencies and features

* exponential backtracking - unit tests

* rewrote expression with pathological backtracking

* disabling double hyphen tests for now

* test additional variants of repeating punctuation

* remove regex and redundant backslashes from load_reddit script

* small typo fixes

* disable double punctuation test for russian

* clean up old comments

* format block code

* final cleanup

* naming consistency

* french strings as unicode for python 2 support

* french regular expression case insensitive
2019-02-01 18:05:22 +11:00
Ines Montani
61d09c481b Merge branch 'master' into develop 2018-12-18 13:48:10 +01:00
Sofie
c6ad557cea French regular expressions instead of extensive exceptions list (on develop) (#3046) (resolves #2679)
* merge changes of PR 3023 into develop branch instead of master

* further deletions from exception list according to PR 3023
2018-12-16 18:04:55 +01:00
Ines Montani
eddeb36c96
💫 Tidy up and auto-format .py files (#2983)
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-11-30 17:03:03 +01:00
Sofie
585de273cd Fix small typo bug in French regexp + relevant unit test (#2980)
* additional unit test for new entr word not in other lists

* bugfix - unit test works

* use _latin_lower instead of alpha_lower for french

* revert back to ALPHA_LOWER (following the code for languages)

* contributor agreement
2018-11-29 20:16:13 +01:00
ines
acb9bdb852 Fix PRON_LEMMA imports 2017-11-06 17:41:53 +01:00
ines
819e30a26e Tidy up tokenizer exceptions 2017-11-01 23:02:45 +01:00
ines
7e424a1804 Don't copy exception dicts if not necessary and tidy up 2017-10-31 21:05:29 +01:00
ines
a91278cb32 Rename _URL_PATTERN to URL_PATTERN 2017-05-09 00:00:00 +02:00
ines
73b577cb01 Fix relative imports 2017-05-08 22:29:04 +02:00
ines
f46ffe3e89 Move language data to /lang module 2017-05-08 20:00:40 +02:00