Load exceptions last in Tokenizer.from_bytes

In `Tokenizer.from_bytes`, the exceptions should be loaded last so that they are only processed once as part of loading the model. The exceptions are tokenized as phrase matcher patterns in the background and the internal tokenization needs to be synced with all the remaining tokenizer settings. If the exceptions are not loaded last, there are speed regressions for `Tokenizer.from_bytes/disk` vs. `Tokenizer.add_special_case` as the caches are reloaded more than necessary during deserialization.
2026-01-14 04:19:07 +03:00 · 2023-04-20 09:02:34 +02:00 · 2023-04-20 09:02:34 +02:00 · e315a2637e
commit e315a2637e
parent 8e6a3d58d8
1 changed files with 4 additions and 2 deletions
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -834,10 +834,12 @@ cdef class Tokenizer:
            self.token_match = re.compile(data["token_match"]).match
        if "url_match" in data and isinstance(data["url_match"], str):
            self.url_match = re.compile(data["url_match"]).match
-        if "rules" in data and isinstance(data["rules"], dict):
-            self.rules = data["rules"]
        if "faster_heuristics" in data:
            self.faster_heuristics = data["faster_heuristics"]
+        # always load rules last so that all other settings are set before the
+        # internal tokenization for the phrase matcher
+        if "rules" in data and isinstance(data["rules"], dict):
+            self.rules = data["rules"]
        return self