[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)
New submission from Dave Mankoff : Title pretty much says it all. Simple test case: >>> len(u' \t\r\n\u200B'.strip()) 1 Should be zero. Same problem in Python3: >>> len(' \t\r\n\u200B'.strip()) 1 -- components: Unicode messages: 147538 nosy: ezio.melotti, mankyd priority: normal severity: normal status: open title: string.strip Does Not Remove Zero-Width-Space (ZWSP) type: behavior versions: Python 2.7, Python 3.2 ___ Python tracker <http://bugs.python.org/issue13391> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)
Dave Mankoff added the comment: I appreciated the quick turnaround on this. Perhaps I am misunderstanding the resolution. I understand that strip uses _PyUnicode_IsWhitespace, and that _PyUnicode_IsWhitespace "Returns 1 for Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise." However, perhaps this is where the functionality is missing? Upon further inspection, it looks like there may be other missing white-space characters, such as U+FEFF, "Zero Width No-Break Space". Whatever unicode categories their in, they're still a form of white-space and should still be removed, no? This was not the behavior I expected from strip(). This affects string.issspace() as well. I now have to put var.strip().strip(u'\u200B\ufeff') anywhere I want to test for whitespace strings in all my future python code. (I was bit by exactly this issue in my code which is what caused me to file the issue in the first place.) -- ___ Python tracker <http://bugs.python.org/issue13391> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)
Dave Mankoff added the comment: But why are they not a space? I mean, they literally have the word space in their name and are used as separators between words. I can't really see any reason why you wouldn't want this behavior - there's not time when I would be thankful that strip removed all spaces except for ZWSP and the likes. As to deprecation, yes, that is true, but they still exist and will continue to do so. (My issue arose when a 3rd party delivered an all whitespace string to me.) I can't really debate this further as there's not much more to say. I hope the issue will be reconsidered. Thanks again for taking the time to discuss. -- ___ Python tracker <http://bugs.python.org/issue13391> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)
Dave Mankoff added the comment: So I contacted the Unicode Technical Committee about the issue and received a promptly received a response back. They pointed that the ZWSP was, once upon a time considered white space but that was changed in Unicode 4.0.1 http://www.unicode.org/review/resolved-pri.html#pri21 One particular comment worth noting: "... for historical reasons the general category is still Zs (Space Separator)". Perhaps this ticket can be changed to a feature request? In addition to stripping out whitespace, it is useful to remove any non-printable characters from a string (or know if a string contains any non-printable characters). Perhaps a boolean keyword parameter, "control_chars" could be added to isspace and strip? Thus: >>> u' \t\r\n\u200B'.isspace(control_chars=True) True -- ___ Python tracker <http://bugs.python.org/issue13391> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)
Dave Mankoff added the comment: "Use regular expressions for more advanced stripping than what the .strip method provides." So I guess this brings me back to my original issue. I'm not looking for particularly advanced stripping. I just want to remove all whitespace and other non-printing characters. I personally can never think of a time when I wouldn't want this (especially with isspace). Maybe in some applications, the control characters are useful and shouldn't be stripped, but I would argue that _that_ is the more advanced use case for most people. Thus strip and isspace are now unusable methods in Python for common use cases. This seems unfortunate. I can understand the claims of feature creep. I even understand that having isspace compare itself against non-whitespace characters may seem counter-intuitive on its face. But certainly there must be a satisfactory remedy here. -- ___ Python tracker <http://bugs.python.org/issue13391> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com