Commit Graph

7 Commits

Author SHA1 Message Date
Ines Montani
f580302673 Tidy up and auto-format 2019-08-20 17:36:34 +02:00
Paul O'Leary McCann
756b66b7c0 Reduce size of language data (#4141)
* Move Turkish lemmas to a json file

Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.

This focuses on Turkish specifically because it has the most language
data in core.

* Transition all lemmatizer.py files to json

This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.

None of the affected files have logic, so this was pretty
straightforward.

One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.

* Move large lang data to json for fr/nb/nl/sv

These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.

For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.

* Fix id lemmas.json

The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.

* Add .json.gz to gitignore

This covers the json.gz files built as part of distribution.

* Add language data gzip to build process

Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.

* Remove Danish lemmatizer.py

Missed this when I added the json.

* Update to match latest explosion/srsly#9

The way gzipped json is loaded/saved in srsly changed a bit.

* Only compress language data if necessary

If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.

* Move en/el language data to json

This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.

* Remove empty files in Norwegian tokenizer

It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.

This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.

* Remove dubious entries in English lookup.json

" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.

* Fix small issues with en/fr lemmatizers

The en tokenizer was including the removed _nouns.py file, so that's
removed.

The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.

* Auto-format

* Auto-format

* Update srsly pin

* Consistently use pathlib paths
2019-08-20 14:54:11 +02:00
Knut O. Hellan
a54f0cfc2b Norwegian tweaks (#3894)
* Norwegian fix

Add support for alternative past tense verb form (vaska).

* Norwegian months

Add all Norwegian months to tokenizer excpetions.

* More Norwegian abbreviations

Add more Norwegian abbreviations to tokenizer_exceptions.

* Contributor agreement khellan

Add signed contributor agreement for khellan (Knut O. Hellan).
2019-07-08 10:28:47 +02:00
Ines Montani
61d09c481b Merge branch 'master' into develop 2018-12-18 13:48:10 +01:00
Beate Sildnes
f0d7e206ec Updated wordforms for Norwegian lemmatizer (#3007)
* Updated wordforms for Norwegian lemmatizer

Upload of updated lists of wordforms for the Norwegian lemmatizer (nouns, verbs, adverbs, adjectives and lookup).

* Add spaCy contributor agreement for user beatesi

*  Updated wordforms for Norwegian lemmatizer
2018-12-06 15:46:18 +01:00
Ines Montani
eddeb36c96
💫 Tidy up and auto-format .py files (#2983)
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-11-30 17:03:03 +01:00
katarkor
5ca853bee0 changed tag_map, morph_rules, lemmatizer for Norwegian (#2565)
* changed tag_map, morph_rules, lemmatizer for Norwegian

* Move unicode declaration up

Hopefully fixes test failure on Python 2

* Update CONTRIBUTOR_AGREEMENT.md

* Move unicode declarations

Hopefully fixes test this time

* Revert "Merge remote-tracking branch 'origin/patch-1'"

This reverts commit f5ccd5dd0d, reversing
changes made to dd07e180ea.

* Update contributor agreement [ci skip]
2018-07-19 19:38:24 +02:00