spaCy/spacy/lang/en/lex_attrs.py

# coding: utf8
from __future__ import unicode_literals

from ...attrs import LIKE_NUM

_num_words = [
    "zero",
    "one",
    "two",
    "three",
    "four",
    "five",
    "six",
    "seven",
    "eight",
    "nine",
    "ten",
    "eleven",
    "twelve",
    "thirteen",
    "fourteen",
    "fifteen",
    "sixteen",
    "seventeen",
    "eighteen",
    "nineteen",
    "twenty",
    "thirty",
    "forty",
    "fifty",
    "sixty",
    "seventy",
    "eighty",
    "ninety",
    "hundred",
    "thousand",
    "million",
    "billion",
    "trillion",
    "quadrillion",
    "gajillion",
    "bazillion",
]


_ordinal_words = [
    "first",
    "second",
    "third",
    "fourth",
    "fifth",
    "sixth",
    "seventh",
    "eighth",
    "ninth",
    "tenth",
    "eleventh",
    "twelfth",
    "thirteenth",
    "fourteenth",
    "fifteenth",
    "sixteenth",
    "seventeenth",
    "eighteenth",
    "nineteenth",
    "twentieth",
    "thirtieth",
    "fortieth",
    "fiftieth",
    "sixtieth",
    "seventieth",
    "eightieth",
    "ninetieth",
    "hundredth",
    "thousandth",
    "millionth",
    "billionth",
    "trillionth",
    "quadrillionth",
    "gajillionth",
    "bazillionth",
]

def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True

    text_lower = text.lower()
    if text_lower in _num_words:
        return True

    # CHeck ordinal number
    if text_lower in _ordinal_words:
        return True
    if text_lower.endswith("th"):
        if text_lower[:-2].isdigit():
            return True 

    return False


LEX_ATTRS = {LIKE_NUM: like_num}
Reorganise English language data 2017-05-08 16:47:25 +03:00			`# coding: utf8`
			`from __future__ import unicode_literals`

Add English lex_attrs overrides 2017-05-09 02:09:52 +03:00			`from ...attrs import LIKE_NUM`
Reorganise English language data 2017-05-08 16:47:25 +03:00
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`_num_words = [`
			`"zero",`
			`"one",`
			`"two",`
			`"three",`
			`"four",`
			`"five",`
			`"six",`
			`"seven",`
			`"eight",`
			`"nine",`
			`"ten",`
			`"eleven",`
			`"twelve",`
			`"thirteen",`
			`"fourteen",`
			`"fifteen",`
			`"sixteen",`
			`"seventeen",`
			`"eighteen",`
			`"nineteen",`
			`"twenty",`
			`"thirty",`
			`"forty",`
			`"fifty",`
			`"sixty",`
			`"seventy",`
			`"eighty",`
			`"ninety",`
			`"hundred",`
			`"thousand",`
			`"million",`
			`"billion",`
			`"trillion",`
			`"quadrillion",`
			`"gajillion",`
			`"bazillion",`
			`]`
Reorganise English language data 2017-05-08 16:47:25 +03:00

English: adds ordinal numbers (#5830) 2020-07-29 21:22:47 +03:00			`_ordinal_words = [`
			`"first",`
			`"second",`
			`"third",`
			`"fourth",`
			`"fifth",`
			`"sixth",`
			`"seventh",`
			`"eighth",`
			`"ninth",`
			`"tenth",`
			`"eleventh",`
			`"twelfth",`
			`"thirteenth",`
			`"fourteenth",`
			`"fifteenth",`
			`"sixteenth",`
			`"seventeenth",`
			`"eighteenth",`
			`"nineteenth",`
			`"twentieth",`
			`"thirtieth",`
			`"fortieth",`
			`"fiftieth",`
			`"sixtieth",`
			`"seventieth",`
			`"eightieth",`
			`"ninetieth",`
			`"hundredth",`
			`"thousandth",`
			`"millionth",`
			`"billionth",`
			`"trillionth",`
			`"quadrillionth",`
			`"gajillionth",`
			`"bazillionth",`
			`]`

Add English lex_attrs overrides 2017-05-09 02:09:52 +03:00			`def like_num(text):`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`if text.startswith(("+", "-", "±", "~")):`
💫 Make like_num work for prefixed numbers (#2808) * Only split + prefix if not numbers * Make like_num work for prefixed numbers * Add test for like_num 2018-10-01 11:49:14 +03:00			`text = text[1:]`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`text = text.replace(",", "").replace(".", "")`
Add English lex_attrs overrides 2017-05-09 02:09:52 +03:00			`if text.isdigit():`
			`return True`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`if text.count("/") == 1:`
			`num, denom = text.split("/")`
Add English lex_attrs overrides 2017-05-09 02:09:52 +03:00			`if num.isdigit() and denom.isdigit():`
			`return True`
Hebrew like num (#5952) * Update stop_words.py Hebrew STOP WORDS * Update stop_words.py * contributor * contributor * add some common domain extentions support human number 1K/1M.... * support human number 1K/1M.... * hebrew number tokenize 1K/1M implement in EN * test human tokenize fix * test * heb like num revert human number change * heb like num 2020-08-24 15:30:05 +03:00
English: adds ordinal numbers (#5830) 2020-07-29 21:22:47 +03:00			`text_lower = text.lower()`
			`if text_lower in _num_words:`
Add English lex_attrs overrides 2017-05-09 02:09:52 +03:00			`return True`
English: adds ordinal numbers (#5830) 2020-07-29 21:22:47 +03:00
			`# CHeck ordinal number`
			`if text_lower in _ordinal_words:`
			`return True`
			`if text_lower.endswith("th"):`
			`if text_lower[:-2].isdigit():`
			`return True`

Add English lex_attrs overrides 2017-05-09 02:09:52 +03:00			`return False`


💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00			`LEX_ATTRS = {LIKE_NUM: like_num}`