spaCy/spacy/tests/regression/test_issue2926.py

# coding: utf8
from __future__ import unicode_literals
from spacy.lang.fr import French


def test_issue2926():
    """ Test that the tokenizer correctly splits tokens separated by a slash (/) ending in a digit """
    nlp = French()
    text = "Learn html5/css3/javascript/jquery"
    doc = nlp(text)

    assert len(doc) == 8

    assert doc[0].text == "Learn"
    assert doc[1].text == "html5"
    assert doc[2].text == "/"
    assert doc[3].text == "css3"
    assert doc[4].text == "/"
    assert doc[5].text == "javascript"
    assert doc[6].text == "/"
    assert doc[7].text == "jquery"
Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293) * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib 2019-02-21 00:10:13 +03:00			`# coding: utf8`
			`from __future__ import unicode_literals`
			`from spacy.lang.fr import French`


			`def test_issue2926():`
			`""" Test that the tokenizer correctly splits tokens separated by a slash (/) ending in a digit """`
			`nlp = French()`
			`text = "Learn html5/css3/javascript/jquery"`
			`doc = nlp(text)`

			`assert len(doc) == 8`

			`assert doc[0].text == "Learn"`
			`assert doc[1].text == "html5"`
			`assert doc[2].text == "/"`
			`assert doc[3].text == "css3"`
			`assert doc[4].text == "/"`
			`assert doc[5].text == "javascript"`
			`assert doc[6].text == "/"`
			`assert doc[7].text == "jquery"`