[issue37723] important performance regression on regular expression parsing
New submission from yannvgn : On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7 Example (ipython): In [1]: import re In [2]: char_list = ''.join([chr(i) for i in range(0x)]) In [3]: long_char_list = char_list * 10 In [4]: pattern = f'[{re.escape(long_char_list)}]' In [5]: %time compiled = re.compile(pattern) The test was run on Amazon Linux AMI 2017.03. On Python 3.6.1, the regexp compiled in 2.6 seconds: CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s Wall time: 2.64 s On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this case): CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s Wall time: 15min 9s Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6 The complexity of this function is on average O(N^2) but can be easily reduced to O(N). The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression. -- components: Regular Expressions messages: 348771 nosy: ezio.melotti, mrabarnett, yannvgn priority: normal severity: normal status: open title: important performance regression on regular expression parsing type: performance versions: Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue37723> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37723] important performance regression on regular expression parsing
Change by yannvgn : -- keywords: +patch pull_requests: +14788 stage: -> patch review pull_request: https://github.com/python/cpython/pull/15030 ___ Python tracker <https://bugs.python.org/issue37723> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37723] important performance regression on regular expression parsing
yannvgn added the comment: > Indeed, it was not expected that the character set contains hundreds of > thousands items. What is its size in your real code? > Could you please show benchmarking results for different implementations and > different sizes? I can't precisely answer that, but sacremoses (a tokenization package) for example is strongly impacted. See https://github.com/alvations/sacremoses/issues/61#issuecomment-516401853 -- ___ Python tracker <https://bugs.python.org/issue37723> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37723] important performance regression on regular expression parsing
yannvgn added the comment: Hey Matthew, we decided to go for this, which is simpler and straightforward: def _uniq(items): return list(dict.fromkeys(items)) (see https://github.com/python/cpython/pull/15030) -- ___ Python tracker <https://bugs.python.org/issue37723> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com