from:"veaba"

[issue38582] Regular match overflow

2019-10-24 Thread veaba


New submission from veaba <908662...@qq.com>:

Regular match overflow
code:
```python
import re
# list => str
def list_to_str(str_list, code=""):
if isinstance(str_list, list):
return code.join(str_list)
else:
return ''
def fn_parse_code(list_str, text):
code_text_str=''
code_text_str = list_to_str(list_str, '|')  # => xxx|oo
reg_list = ['+', '.', '[', ']','((','))']
for reg in reg_list:
code_text_str_temp = code_text_str.replace(reg, "\\" + 
reg).replace('((','(\\(').replace('))','))')
code_text_str=code_text_str_temp

# compile
pattern_str = re.compile(code_text_str )

#list_str=>\\1\\2\\...

flag_str = ''

for item in enumerate(list_str):
flag_str = flag_str + "\\" + str(item[0] + 1)

print("1:",pattern_str)
print("2:",flag_str)
print("3:",text)
reg_text = re.sub(r''+code_text_str+'', "`" + flag_str + "`", text)
return reg_text

# a list
strs=['(tf.batch_gather)', '(None)', '(tf.bitwise.bitwise_and)', '(None)', 
'(tf.bitwise.bitwise_or)', '(None)', '(tf.bitwise.bitwise_xor)', '(None)', 
'(tf.bitwise.invert)', '(None)', '(tf.bitwise.left_shift)', '(None)', 
'(tf.bitwise.right_shift)', '(None)', '(tf.clip_by_value)', '(None)', 
'(tf.concat)', "('concat')", '(tf.debugging.check_numerics)', '(None)', 
'(tf.dtypes.cast)', '(None)', '(tf.dtypes.complex)', '(None)', 
'(tf.dtypes.saturate_cast)', '(None)', '(tf.dynamic_partition)', '(None)', 
'(tf.expand_dims)', '(None)', '(None)', '(None)', '(tf.gather_nd)', '(None)', 
'(0)', '(tf.gather)', '(None)', '(None)', '(None)', '(0)', '(tf.identity)', 
'(None)', '(tf.io.decode_base64)', '(None)', '(tf.io.decode_compressed)', 
"('')", '(None)', '(tf.io.encode_base64)', '(False)', '(None)', 
'(tf.math.abs)', '(None)', '(tf.math.acos)', '(None)', '(tf.math.acosh)', 
'(None)', '(tf.math.add_n)', '(None)', '(tf.math.add)', '(None)', 
'(tf.math.angle)', '(None)', '(tf.math.asin)', '(None)', '(tf.math.asinh)', 
'(None)', '(tf.math.atan2)', '(None)', '(tf.math.atan)', '(None)', 
'(tf.math.atanh)', '(None)', '(tf.math.ceil)', '(None)', '(tf.math.conj)', 
'(None)', '(tf.math.cos)', '(None)', '(tf.math.cosh)', '(None)', 
'(tf.math.digamma)', '(None)', '(tf.math.divide_no_nan)', '(None)', 
'(tf.math.divide)', '(None)', '(tf.math.equal)', '(None)', '(tf.math.erf)', 
'(None)', '(tf.math.erfc)', '(None)', '(tf.math.exp)', '(None)', 
'(tf.math.expm1)', '(None)', '(tf.math.floor)', '(None)', '(tf.math.floordiv)', 
'(None)', '(tf.math.floormod)', '(None)', '(tf.math.greater_equal)', '(None)', 
'(tf.math.greater)', '(None)', '(tf.math.imag)', '(None)', 
'(tf.math.is_finite)', '(None)', '(tf.math.is_inf)', '(None)', 
'(tf.math.is_nan)', '(None)', '(tf.math.less_equal)', '(None)', 
'(tf.math.less)', '(None)', '(tf.math.lgamma)', '(None)', '(tf.math.log1p)', 
'(None)', '(tf.math.log_sigmoid)', '(None)', '(tf.math.log)', '(None)', 
'(tf.math.logical_and)', '(None)', '(tf.math.logical_not)', '(None)', 
'(tf.math.logical_or)', '(None)', '(tf.math.logical_xor)', "('LogicalXor')", 
'(tf.math.maximum)', '(None)', '(tf.math.minimum)', '(None)', 
'(tf.math.multiply)', '(None)', '(tf.math.negative)', '(None)', 
'(tf.math.not_equal)', '(None)', '(tf.math.pow)', '(None)', '(tf.m

[issue38582] re: backreference number in replace string can't >= 100

2019-10-24 Thread veaba


veaba <908662...@qq.com> added the comment:

这里来自实际我的一个项目（https://github.com/veaba/tensorflow-docs/blob/master/scripts/spider_tensorflow_docs.py#L39-L56），当然也许我这个方法不是正确的，它只是我刚学python的一个尝试。

这个项目步骤是这样：根据HTML tag 提取文本转为markdown格式。 标签，需要用符号“`”包围，然后循环里面将匹配的字符通过\\* 
替换出来。

所以，你们见到了，我发现这样的一个正则溢出错误。


如果能够放开反斜杠替换符无限个数限制对我会很友好，当然如果真的不需要的话，我自己想别的办法。


This is from a project I actually worked on 
(https://github.com/veaba/tensorflow-docs/blob/master/scripts/spider_tensorflow_docs.py#L39-L56).
 Of course, this method is not correct. It's just an attempt to learn python.



The project steps are as follows: extract the text according to HTML tag and 
change it to markdown format. The < code > label needs to be surrounded by the 
symbol "`", and then the matching characters are replaced by \ \ * in the loop.



So, as you can see, I found such a regular overflow error.




It would be nice for me to be able to let go of the infinite number of 
backslash substitutions. Of course, if I really don't need it, I'll try 
something else.

--
hgrepos: +385

___
Python tracker 
<https://bugs.python.org/issue38582>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread veaba


veaba <908662...@qq.com> added the comment:

Yes, this is not a good place to use regular expressions.

Using regular expressions：
def actual_re_demo():
import re
# This is an indefinite string...
text = "tf.where(condition, x=None, y=None, name=None) tf.batch_gather ..."

# Converting fields that need to be matched into regular expressions is 
also an indefinite string
pattern_str = re.compile('(tf\\.batch_gather)|(None)|(a1)')

#I don't know how many, so it's over \ \ 100 \ \ n
x = re.sub(pattern_str, '`'+'\\1\\2'+'`', text)

print(x)

# hope if：tf.Prefix needs to match,The result will be:`tf.xx`，

# But in fact, it's not just TF. As a prefix, it's a random character, it 
can be a suffix, it can be other characters.

#  If more than 100, the result 
is=>：989¡¢£¤¥¦§89¨©ª«¬®¯89°±²³´µ¶·89¸¹º»¼½¾¿890123`, 
name=`None@ABCDEFG89HIJKLMNO89PQRSTUVW89XYZ[\]^_89`abcdefg89hijklmno89pqrstuvw89xyz{|}~8901234567890123456789

# I noticed in the comment area that it was caused by a confusion of Radix, 
which seems to be embarrassing.


Use replace to solve it. It looks much better.
def no_need_re():
text = "tf.where(condition, x=None, y=None, name=None) tf.batch_gather ..."
pattern_list = ['tf.batch_gather', 'None']
for item in pattern_list:
text=text.replace(item, '`'+item+'`')

print(text)

no_need_re()

Expect to report an error directly if it exceeds the limit, instead of 
overflowing the character, like this:

989¡¢£¤¥¦§89¨©ª«¬®¯89°±²³´µ¶·89¸¹º»¼½¾¿890123`, 
name=`None@ABCDEFG89HIJKLMNO89PQRSTUVW89XYZ[\]^_89`abcdefg89hijklmno89pqrstuvw89xyz{|}~8901234567890123456789

--

___
Python tracker 
<https://bugs.python.org/issue38582>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38582] re: backreference number in replace string can't >= 100

2019-10-25 Thread veaba



veaba <908662...@qq.com> added the comment:

Aha, it's me. It's the mysterious power from the East. I just learned python.

I've solved my problem. It's a very simple replace replacement, and it's solved 
in three lines.

I'm trying to solve the problem of inadvertently finding out in the process of 
translating HTML text into markdown file. The document contains very complex 
strings, so I do that. Now it seems that the method I used before is a very 
inappropriate and inappropriate way to implement, which is a mistake.

However, I insist that this regular overflow is still a problem. It doesn't 
even translate a bunch of meaningless strings without any error.

I didn't find such a bug until I randomly selected and checked 2. K documents. 
I don't know if it's unlucky or lucky.

Then, I will not participate in the discussion of the remaining high-end issues.

Good luck.

--

___
Python tracker 
<https://bugs.python.org/issue38582>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38582] Regular match overflow

[issue38582] re: backreference number in replace string can't >= 100

[issue38582] re: backreference number in replace string can't >= 100

[issue38582] re: backreference number in replace string can't >= 100

4 matches

Site Navigation

Mail list logo

Footer information