Commit Graph

6 Commits

Author SHA1 Message Date
Ines Montani
eddeb36c96
💫 Tidy up and auto-format .py files (#2983)
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-11-30 17:03:03 +01:00
Ines Montani
ea20b72c08 💫 Make like_num work for prefixed numbers (#2808)
* Only split + prefix if not numbers

* Make like_num work for prefixed numbers

* Add test for like_num
2018-10-01 10:49:14 +02:00
Jani Monoses
ec62cadf4c Updates to Romanian support (#2354)
* Add back Romanian in conftest

* Romanian lex_attr

* More tokenizer exceptions for Romanian

* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Jani Monoses
0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Jani Monoses
42b34832e4 Update Romanian stopword list (#2316)
* Contributor agreement for janimo

* Update Romanian stopword list

Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).

Add another unrelated spelling fix.

See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
ines
c6fea3e5f6 Add Romanian and Croatian skeletons (experimental)
Add language data templates to make it easier for others to contribute to the language support
2017-11-01 23:04:28 +01:00