spaCy/spacy/tests/regression/test_issue1506.py

# coding: utf8
from __future__ import unicode_literals

import gc

from ...lang.en import English


def test_issue1506():
    nlp = English()

    def string_generator():
        for _ in range(10001):
            yield u"It's sentence produced by that bug."

        for _ in range(10001):
            yield u"I erase some hbdsaj lemmas."

        for _ in range(10001):
            yield u"I erase lemmas."

        for _ in range(10001):
            yield u"It's sentence produced by that bug."

        for _ in range(10001):
            yield u"It's sentence produced by that bug."

    for i, d in enumerate(nlp.pipe(string_generator())):
        # We should run cleanup more than one time to actually cleanup data.
        # In first run — clean up only mark strings as «not hitted».
        if i == 10000 or i == 20000 or i == 30000:
            gc.collect()

        for t in d:
            str(t.lemma_)
Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00			`# coding: utf8`
			`from __future__ import unicode_literals`

StringStore now actually cleaned Do not lose docs in ref tracking 2017-11-14 17:45:50 +03:00			`import gc`

Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00			`from ...lang.en import English`


			`def test_issue1506():`
			`nlp = English()`

			`def string_generator():`
Fix test imports and last batch cleanup 2017-11-11 11:31:59 +03:00			`for _ in range(10001):`
Completely cleanup tokenizer cache Tokenizer cache can have be different keys than string That modification can slow down tokenizer and need to be measured 2017-11-15 17:55:48 +03:00			`yield u"It's sentence produced by that bug."`
Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
Use safer method to get string without hit 2017-11-14 22:58:46 +03:00			`for _ in range(10001):`
Completely cleanup tokenizer cache Tokenizer cache can have be different keys than string That modification can slow down tokenizer and need to be measured 2017-11-15 17:55:48 +03:00			`yield u"I erase some hbdsaj lemmas."`
Use safer method to get string without hit 2017-11-14 22:58:46 +03:00
Fix test imports and last batch cleanup 2017-11-11 11:31:59 +03:00			`for _ in range(10001):`
Completely cleanup tokenizer cache Tokenizer cache can have be different keys than string That modification can slow down tokenizer and need to be measured 2017-11-15 17:55:48 +03:00			`yield u"I erase lemmas."`
Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
Fix test imports and last batch cleanup 2017-11-11 11:31:59 +03:00			`for _ in range(10001):`
Completely cleanup tokenizer cache Tokenizer cache can have be different keys than string That modification can slow down tokenizer and need to be measured 2017-11-15 17:55:48 +03:00			`yield u"It's sentence produced by that bug."`
Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
StringStore now actually cleaned Do not lose docs in ref tracking 2017-11-14 17:45:50 +03:00			`for _ in range(10001):`
Completely cleanup tokenizer cache Tokenizer cache can have be different keys than string That modification can slow down tokenizer and need to be measured 2017-11-15 17:55:48 +03:00			`yield u"It's sentence produced by that bug."`
StringStore now actually cleaned Do not lose docs in ref tracking 2017-11-14 17:45:50 +03:00
			`for i, d in enumerate(nlp.pipe(string_generator())):`
Create test that fails when actual cleanup caused 2017-11-14 20:28:13 +03:00			`# We should run cleanup more than one time to actually cleanup data.`
			`# In first run — clean up only mark strings as «not hitted».`
Remove all obsolete code and test only initial problem 2017-11-14 20:45:04 +03:00			`if i == 10000 or i == 20000 or i == 30000:`
Create test that fails when actual cleanup caused 2017-11-14 20:28:13 +03:00			`gc.collect()`

Get back previous testcase 2017-11-14 18:01:37 +03:00			`for t in d:`
			`str(t.lemma_)`