spaCy/spacy/lang/it/tokenizer_exceptions.py

from ...symbols import ORTH
from ...util import update_exc
from ..tokenizer_exceptions import BASE_EXCEPTIONS

_exc = {
    "all'art.": [{ORTH: "all'"}, {ORTH: "art."}],
    "dall'art.": [{ORTH: "dall'"}, {ORTH: "art."}],
    "dell'art.": [{ORTH: "dell'"}, {ORTH: "art."}],
    "L'art.": [{ORTH: "L'"}, {ORTH: "art."}],
    "l'art.": [{ORTH: "l'"}, {ORTH: "art."}],
    "nell'art.": [{ORTH: "nell'"}, {ORTH: "art."}],
    "po'": [{ORTH: "po'"}],
    "sett..": [{ORTH: "sett."}, {ORTH: "."}],
}

for orth in [
    "..",
    "....",
    "a.C.",
    "al.",
    "all-path",
    "art.",
    "Art.",
    "artt.",
    "att.",
    "avv.",
    "Avv.",
    "by-pass",
    "c.d.",
    "c/c",
    "C.so",
    "centro-sinistra",
    "check-up",
    "Civ.",
    "cm.",
    "Cod.",
    "col.",
    "Cost.",
    "d.C.",
    'de"',
    "distr.",
    "E'",
    "ecc.",
    "e-mail",
    "e/o",
    "etc.",
    "Jr.",
    "n°",
    "nord-est",
    "pag.",
    "Proc.",
    "prof.",
    "sett.",
    "s.p.a.",
    "s.n.c",
    "s.r.l",
    "ss.",
    "St.",
    "tel.",
    "week-end",
]:
    _exc[orth] = [{ORTH: orth}]

TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
Remove POS, TAG and LEMMA from tokenizer exceptions 2020-07-23 00:09:01 +03:00			`from ...symbols import ORTH`
Tidy up and move noun_chunks, token_match, url_match 2020-07-22 23:18:46 +03:00			`from ...util import update_exc`
Configure isort to use the Black profile, recursively isort the `spacy` module (#12721) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo 2023-06-14 18:48:41 +03:00			`from ..tokenizer_exceptions import BASE_EXCEPTIONS`
Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293) * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib 2019-02-21 00:10:13 +03:00
Improve Italian tokenization (#5204) Improve Italian tokenization for UD_Italian-ISDT. 2020-03-25 13:28:02 +03:00			`_exc = {`
			`"all'art.": [{ORTH: "all'"}, {ORTH: "art."}],`
			`"dall'art.": [{ORTH: "dall'"}, {ORTH: "art."}],`
			`"dell'art.": [{ORTH: "dell'"}, {ORTH: "art."}],`
			`"L'art.": [{ORTH: "L'"}, {ORTH: "art."}],`
			`"l'art.": [{ORTH: "l'"}, {ORTH: "art."}],`
			`"nell'art.": [{ORTH: "nell'"}, {ORTH: "art."}],`
Remove POS, TAG and LEMMA from tokenizer exceptions 2020-07-23 00:09:01 +03:00			`"po'": [{ORTH: "po'"}],`
Tidy up and auto-format 2020-03-25 14:28:12 +03:00			`"sett..": [{ORTH: "sett."}, {ORTH: "."}],`
Improve Italian tokenization (#5204) Improve Italian tokenization for UD_Italian-ISDT. 2020-03-25 13:28:02 +03:00			`}`

			`for orth in [`
			`"..",`
			`"....",`
Added more exception to the italian language from https://forum.wordr… (#7246) * Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/ * Remove unnecessary exception Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> 2021-03-30 11:23:32 +03:00			`"a.C.",`
Improve Italian tokenization (#5204) Improve Italian tokenization for UD_Italian-ISDT. 2020-03-25 13:28:02 +03:00			`"al.",`
			`"all-path",`
			`"art.",`
			`"Art.",`
			`"artt.",`
			`"att.",`
Added more exception to the italian language from https://forum.wordr… (#7246) * Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/ * Remove unnecessary exception Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> 2021-03-30 11:23:32 +03:00			`"avv.",`
Tidy up code 2021-06-28 12:48:00 +03:00			`"Avv.",`
Improve Italian tokenization (#5204) Improve Italian tokenization for UD_Italian-ISDT. 2020-03-25 13:28:02 +03:00			`"by-pass",`
			`"c.d.",`
Added more exception to the italian language from https://forum.wordr… (#7246) * Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/ * Remove unnecessary exception Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> 2021-03-30 11:23:32 +03:00			`"c/c",`
			`"C.so",`
Improve Italian tokenization (#5204) Improve Italian tokenization for UD_Italian-ISDT. 2020-03-25 13:28:02 +03:00			`"centro-sinistra",`
			`"check-up",`
			`"Civ.",`
			`"cm.",`
			`"Cod.",`
			`"col.",`
			`"Cost.",`
			`"d.C.",`
Tidy up and auto-format 2020-03-25 14:28:12 +03:00			`'de"',`
Improve Italian tokenization (#5204) Improve Italian tokenization for UD_Italian-ISDT. 2020-03-25 13:28:02 +03:00			`"distr.",`
			`"E'",`
			`"ecc.",`
			`"e-mail",`
			`"e/o",`
			`"etc.",`
			`"Jr.",`
			`"n°",`
			`"nord-est",`
			`"pag.",`
			`"Proc.",`
			`"prof.",`
			`"sett.",`
			`"s.p.a.",`
Added more exception to the italian language from https://forum.wordr… (#7246) * Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/ * Remove unnecessary exception Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> 2021-03-30 11:23:32 +03:00			`"s.n.c",`
			`"s.r.l",`
Improve Italian tokenization (#5204) Improve Italian tokenization for UD_Italian-ISDT. 2020-03-25 13:28:02 +03:00			`"ss.",`
			`"St.",`
			`"tel.",`
			`"week-end",`
			`]:`
			`_exc[orth] = [{ORTH: orth}]`
Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293) * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib 2019-02-21 00:10:13 +03:00
Tidy up and move noun_chunks, token_match, url_match 2020-07-22 23:18:46 +03:00			`TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)`