Serhiy Storchaka added the comment:
It is possible to change this behavior (see example patch). With this patch:
>>> re.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> re.split(r'\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']
But unfortunately this is backward incompatible change and will likely break
existing code (and breaks tests). Consider following example: re.split('(:*)',
'ab'). Currently the result is ['ab'], but with the patch it is ['', '', 'a',
'', 'b', '', ''].
In third-part regex module [1] there is the V1 flag which switches incompatible
bahavior change.
>>> regex.split('(:*)', 'ab')
['ab']
>>> regex.split('(?V1)(:*)', 'ab')
['', '', 'a', '', 'b', '', '']
>>> regex.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCAGCTGAAACCCCAGCTGACGTACGT']
>>> regex.split(r'(?V1)(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> regex.split(r'\b', "the quick, brown fox")
['the quick, brown fox']
>>> regex.split(r'(?V1)\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']
I don't know how to solve this issue without introducing such flag (or adding
special boolean argument to re.split()).
As a workaround I suggest you to use the regex module.
[1] https://pypi.python.org/pypi/regex
----------
keywords: +patch
Added file: http://bugs.python.org/file37147/re_split_zero_width.patch
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue22817>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com