spaCy/spacy/tests/lang/mul/test_tokenizer.py
Edward 360ccf628a
Rename language codes (Icelandic, multi-language) (#12149)
* Init

* fix tests

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix test_blank_languages

* Rename xx to mul in docs

* Format _util with black

* prettier formatting

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-31 17:30:43 +01:00

26 lines
674 B
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

import pytest
MUL_BASIC_TOKENIZATION_TESTS = [
(
"Lääʹddjânnmest lie nuʹtt 10 000 säʹmmliʹžžed. Seeʹst pâʹjjel",
[
"Lääʹddjânnmest",
"lie",
"nuʹtt",
"10",
"000",
"ʹmmliʹžžed",
".",
"Seeʹst",
"ʹjjel",
],
),
]
@pytest.mark.parametrize("text,expected_tokens", MUL_BASIC_TOKENIZATION_TESTS)
def test_mul_tokenizer_basic(mul_tokenizer, text, expected_tokens):
tokens = mul_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list