spaCy/spacy/lang/pt/lex_attrs.py
Ines Montani eddeb36c96
💫 Tidy up and auto-format .py files (#2983)
<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-11-30 17:03:03 +01:00

123 lines
2.0 KiB
Python

# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = [
"zero",
"um",
"dois",
"três",
"tres",
"quatro",
"cinco",
"seis",
"sete",
"oito",
"nove",
"dez",
"onze",
"doze",
"dúzia",
"dúzias",
"duzia",
"duzias",
"treze",
"catorze",
"quinze",
"dezasseis",
"dezassete",
"dezoito",
"dezanove",
"vinte",
"trinta",
"quarenta",
"cinquenta",
"sessenta",
"setenta",
"oitenta",
"noventa",
"cem",
"cento",
"duzentos",
"trezentos",
"quatrocentos",
"quinhentos",
"seicentos",
"setecentos",
"oitocentos",
"novecentos",
"mil",
"milhão",
"milhao",
"milhões",
"milhoes",
"bilhão",
"bilhao",
"bilhões",
"bilhoes",
"trilhão",
"trilhao",
"trilhões",
"trilhoes",
"quadrilhão",
"quadrilhao",
"quadrilhões",
"quadrilhoes",
]
_ordinal_words = [
"primeiro",
"segundo",
"terceiro",
"quarto",
"quinto",
"sexto",
"sétimo",
"oitavo",
"nono",
"décimo",
"vigésimo",
"trigésimo",
"quadragésimo",
"quinquagésimo",
"sexagésimo",
"septuagésimo",
"octogésimo",
"nonagésimo",
"centésimo",
"ducentésimo",
"trecentésimo",
"quadringentésimo",
"quingentésimo",
"sexcentésimo",
"septingentésimo",
"octingentésimo",
"nongentésimo",
"milésimo",
"milionésimo",
"bilionésimo",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "").replace("º", "").replace("ª", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
if text.lower() in _ordinal_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}