Add Latin language support (#11349)

* Add lang folder for la (Latin) * Add Latin lang classes * Add minimal tokenizer exceptions * Add minimal stopwords * Add minimal lex_attrs * Update stopwords, tokenizer exceptions * Add la tests; register la_tokenizer in conftest.py * Update spacy/lang/la/lex_attrs.py Remove duplicate form in Latin lex_attrs Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update natto-py version spec (#11222) * Update natto-py version spec * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Add scorer to textcat API docs config settings (#11263) * Update docs for pipeline initialize() methods (#11221) * Update documentation for dependency parser * Update documentation for trainable_lemmatizer * Update documentation for entity_linker * Update documentation for ner * Update documentation for morphologizer * Update documentation for senter * Update documentation for spancat * Update documentation for tagger * Update documentation for textcat * Update documentation for tok2vec * Run prettier on edited files * Apply similar changes in transformer docs * Remove need to say annotated example explicitly I removed the need to say "Must contain at least one annotated Example" because it's often a given that Examples will contain some gold-standard annotation. * Run prettier on transformer docs * chore: add 'concepCy' to spacy universe (#11255) * chore: add 'concepCy' to spacy universe * docs: add 'slogan' to concepCy * Support full prerelease versions in the compat table (#11228) * Support full prerelease versions in the compat table * Fix types * adding spans to doc_annotation in Example.to_dict (#11261) * adding spans to doc_annotation in Example.to_dict * to_dict compatible with from_dict: tuples instead of spans * use strings for label and kb_id * Simplify test * Update data formats docs Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix regex invalid escape sequences (#11276) * Add W605 to the errors raised by flake8 in the CI (#11283) * Clean up automated label-based issue handling (#11284) * Clean up automated label-based issue handline 1. upgrade tiangolo/issue-manager to latest 2. move needs-more-info to tiangolo 3. change needs-more-info close time to 7 days 4. delete old needs-more-info config * Use old, longer message * Fix label name * Fix Dutch noun chunks to skip overlapping spans (#11275) * Add test for overlapping noun chunks * Skip overlapping noun chunks * Update spacy/tests/lang/nl/test_noun_chunks.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950) * add in spans example and parse references * rm autoformatter * rm extra ents copy * TypedDict draft * type fixes * restore non-documentation files * docs update * fix spans example * fix hyperlinks * add parse example * example fix + argument fix * fix api arg in docs * fix bad variable replacement * fix spacing in style Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * fix spacing on table * fix spacing on table * rm temp files Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * include span_ruler for default warning filter (#11333) * Add uk pipelines to website (#11332) * Check for . in factory names (#11336) * Make fixes for PR #11349 * Fix roman numeral coverage in #11349 Co-authored-by: Patrick J. Burns <patricks@diyclassics.org> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com> Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com> Co-authored-by: stefawolf <wlf.ste@gmail.com> Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com> Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
2025-09-12 15:12:39 +03:00 · 2022-08-30 08:04:54 -04:00 · 2022-08-30 08:04:54 -04:00 · 5ae63b1fbd
commit 5ae63b1fbd
parent 6723d76f24
9 changed files with 163 additions and 1 deletions
--- a/spacy/lang/la/init.py
+++ b/spacy/lang/la/init.py
@ -0,0 +1,18 @@
 from ...language import Language, BaseDefaults
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 class LatinDefaults(BaseDefaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    stop_words = STOP_WORDS    
    lex_attr_getters = LEX_ATTRS
 class Latin(Language):
    lang = "la"
    Defaults = LatinDefaults
 __all__ = ["Latin"]
--- a/spacy/lang/la/lex_attrs.py
+++ b/spacy/lang/la/lex_attrs.py
@ -0,0 +1,32 @@
 from ...attrs import LIKE_NUM
 import re
 # cf. Goyvaerts/Levithan 2009; case-insensitive, allow 4
 roman_numerals_compile = re.compile(r'(?i)^(?=[MDCLXVI])M*(C[MD]|D?C{0,4})(X[CL]|L?X{0,4})(I[XV]|V?I{0,4})$')
 _num_words = set(
    """
 unus una unum duo duae tres tria quattuor quinque sex septem octo novem decem
 """.split()
 )
 _ordinal_words = set(
    """
 primus prima primum secundus secunda secundum tertius tertia tertium
 """.split()
 )
 def like_num(text):
    if text.isdigit():
        return True
    if roman_numerals_compile.match(text):
        return True
    if text.lower() in _num_words:
        return True
    if text.lower() in _ordinal_words:
        return True
    return False
 LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/la/stop_words.py
+++ b/spacy/lang/la/stop_words.py
@ -0,0 +1,37 @@
 # Corrected Perseus list, cf. https://wiki.digitalclassicist.org/Stopwords_for_Greek_and_Latin
 STOP_WORDS = set(
    """
 ab ac ad adhuc aliqui aliquis an ante apud at atque aut autem 
 cum cur 
 de deinde dum 
 ego enim ergo es est et etiam etsi ex 
 fio 
 haud hic 
 iam idem igitur ille in infra inter interim ipse is ita 
 magis modo mox 
 nam ne nec necque neque nisi non nos 
 o ob 
 per possum post pro 
 quae quam quare qui quia quicumque quidem quilibet quis quisnam quisquam quisque quisquis quo quoniam 
 sed si sic sive sub sui sum super suus 
 tam tamen trans tu tum 
 ubi uel uero
 vel vero
 """.split()
 )
--- a/spacy/lang/la/tokenizer_exceptions.py
+++ b/spacy/lang/la/tokenizer_exceptions.py
@ -0,0 +1,30 @@
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ...symbols import ORTH
 from ...util import update_exc
 ## TODO: Look into systematically handling u/v
 _exc = {
    "mecum": [{ORTH: "me"}, {ORTH: "cum"}],
    "tecum": [{ORTH: "te"}, {ORTH: "cum"}],
    "nobiscum": [{ORTH: "nobis"}, {ORTH: "cum"}],
    "vobiscum": [{ORTH: "vobis"}, {ORTH: "cum"}],
    "uobiscum": [{ORTH: "uobis"}, {ORTH: "cum"}],    
 }
 for orth in [
    'A.', 'Agr.', 'Ap.', 'C.', 'Cn.', 'D.', 'F.', 'K.', 'L.', "M'.", 'M.', 'Mam.', 'N.', 'Oct.', 
    'Opet.', 'P.', 'Paul.', 'Post.', 'Pro.', 'Q.', 'S.', 'Ser.', 'Sert.', 'Sex.', 'St.', 'Sta.', 
    'T.', 'Ti.', 'V.', 'Vol.', 'Vop.', 'U.', 'Uol.', 'Uop.',
    'Ian.', 'Febr.', 'Mart.', 'Apr.', 'Mai.', 'Iun.', 'Iul.', 'Aug.', 'Sept.', 'Oct.', 'Nov.', 'Nou.', 
    'Dec.',
    'Non.', 'Id.', 'A.D.', 
    'Coll.', 'Cos.', 'Ord.', 'Pl.', 'S.C.', 'Suff.', 'Trib.',
 ]:
    _exc[orth] = [{ORTH: orth}]
 TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -256,6 +256,11 @@ def ko_tokenizer_tokenizer():
    return nlp.tokenizer
@pytest.fixture(scope="module")
 def la_tokenizer():
    return get_lang_class("la")().tokenizer    
@pytest.fixture(scope="session")
 def lb_tokenizer():
    return get_lang_class("lb")().tokenizer
--- a/spacy/tests/lang/la/init.py
+++ b/spacy/tests/lang/la/init.py
--- a/spacy/tests/lang/la/test_exception.py
+++ b/spacy/tests/lang/la/test_exception.py
@ -0,0 +1,7 @@
 import pytest
 def test_la_tokenizer_handles_exc_in_text(la_tokenizer):
    text = "scio te omnia facturum, ut nobiscum quam primum sis"
    tokens = la_tokenizer(text)
    assert len(tokens) == 11
    assert tokens[6].text == "nobis"
--- a/spacy/tests/lang/la/test_text.py
+++ b/spacy/tests/lang/la/test_text.py
@ -0,0 +1,33 @@
 import pytest
 from spacy.lang.la.lex_attrs import like_num
@pytest.mark.parametrize(
    "text,match",
    [
        ("IIII", True),
        ("VI", True),
        ("vi", True),
        ("IV", True),
        ("iv", True),
        ("IX", True),
        ("ix", True),
        ("MMXXII", True),
        ("0", True),
        ("1", True),        
        ("quattuor", True),
        ("decem", True),
        ("tertius", True),
        ("canis", False),
        ("MMXX11", False),
        (",", False),
    ],
 )
 def test_lex_attrs_like_number(la_tokenizer, text, match):
    tokens = la_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].like_num == match
@pytest.mark.parametrize("word", ["quinque"])
 def test_la_lex_attrs_capitals(word):
    assert like_num(word)
    assert like_num(word.upper())
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -451,7 +451,7 @@ factories.
 | Registry name     | Description                                                                                                                                                                                                                                        |
 | ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `architectures`   | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`.                                                                           |
-| `augmenters`      | Registry for functions that create [data augmentation](#augmenters) callbacks for corpora and other training data iterators.                                                                                                                       |
+| `augmenters`      | Registry for functions that create [data augmentation](#augmenters) callbacks for corpora and other training data iterators.                                                                                                                       |
 | `batchers`        | Registry for training and evaluation [data batchers](#batchers).                                                                                                                                                                                   |
 | `callbacks`       | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training.                                                                                                                             |
 | `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points).                                                                             |