Switch to regex module for URL identification

The URL detection regex was failing on input such as 0.1.2.3, as this input triggered excessive back-tracking in the builtin re module. The solution was to switch to the regex module, which behaves better. Closes #913.
2025-07-16 03:02:41 +03:00 · 2017-04-07 15:47:36 +02:00 · 2017-04-07 15:47:36 +02:00 · e7b1ee9efd
commit e7b1ee9efd
parent 5887383fc0
1 changed files with 2 additions and 2 deletions
--- a/spacy/language_data/tokenizer_exceptions.py
+++ b/spacy/language_data/tokenizer_exceptions.py
@ -1,6 +1,6 @@
 from __future__ import unicode_literals

-import re
+import regex

 # URL validation regex courtesy of: https://mathiasbynens.be/demo/url-regex
 # A few minor mods to this regex to account for use cases represented in test_urls
@ -45,6 +45,6 @@ _URL_PATTERN = (
    r"$"
 ).strip()

-TOKEN_MATCH = re.compile(_URL_PATTERN, re.UNICODE).match
+TOKEN_MATCH = regex.compile(_URL_PATTERN, regex.UNICODE).match

 __all__ = ['TOKEN_MATCH']