[issue37723] important performance regression on regular expression parsing

2019-07-30 Thread yannvgn


New submission from yannvgn :

On complex cases, parsing regular expressions takes much, much longer on Python 
>= 3.7

Example (ipython):

In [1]: import re
In [2]: char_list = ''.join([chr(i) for i in range(0x)])
In [3]: long_char_list = char_list * 10
In [4]: pattern = f'[{re.escape(long_char_list)}]'
In [5]: %time compiled = re.compile(pattern)

The test was run on Amazon Linux AMI 2017.03.

On Python 3.6.1, the regexp compiled in 2.6 seconds:
CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s
Wall time: 2.64 s

On Python 3.7.3, the regexp compiled in 15 minutes (~350x increase in this 
case):
CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s
Wall time: 15min 9s

Doing some profiling with cProfile shows that the issue is caused by 
sre_parse._uniq function, which does not exist in Python <= 3.6

The complexity of this function is on average O(N^2) but can be easily reduced 
to O(N).

The issue might not be noticeable with simple regexps, but programs like text 
tokenizers - which use complex regexps - might really be impacted by this 
regression.

--
components: Regular Expressions
messages: 348771
nosy: ezio.melotti, mrabarnett, yannvgn
priority: normal
severity: normal
status: open
title: important performance regression on regular expression parsing
type: performance
versions: Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue37723>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-30 Thread yannvgn


Change by yannvgn :


--
keywords: +patch
pull_requests: +14788
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/15030

___
Python tracker 
<https://bugs.python.org/issue37723>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread yannvgn


yannvgn  added the comment:

> Indeed, it was not expected that the character set contains hundreds of 
> thousands items. What is its size in your real code?

> Could you please show benchmarking results for different implementations and 
> different sizes?

I can't precisely answer that, but sacremoses (a tokenization package) for 
example is strongly impacted. See 
https://github.com/alvations/sacremoses/issues/61#issuecomment-516401853

--

___
Python tracker 
<https://bugs.python.org/issue37723>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37723] important performance regression on regular expression parsing

2019-07-31 Thread yannvgn


yannvgn  added the comment:

Hey Matthew,

we decided to go for this, which is simpler and straightforward:

def _uniq(items):
return list(dict.fromkeys(items))

(see https://github.com/python/cpython/pull/15030)

--

___
Python tracker 
<https://bugs.python.org/issue37723>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com