[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)

2011-11-12 Thread Dave Mankoff

New submission from Dave Mankoff :

Title pretty much says it all. Simple test case:

>>> len(u' \t\r\n\u200B'.strip())
1

Should be zero.

Same problem in Python3:

>>> len(' \t\r\n\u200B'.strip())
1

--
components: Unicode
messages: 147538
nosy: ezio.melotti, mankyd
priority: normal
severity: normal
status: open
title: string.strip Does Not Remove Zero-Width-Space (ZWSP)
type: behavior
versions: Python 2.7, Python 3.2

___
Python tracker 
<http://bugs.python.org/issue13391>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)

2011-11-14 Thread Dave Mankoff

Dave Mankoff  added the comment:

I appreciated the quick turnaround on this.

Perhaps I am misunderstanding the resolution. I understand that strip uses 
_PyUnicode_IsWhitespace, and that _PyUnicode_IsWhitespace "Returns 1 for 
Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the 
category 'Zs', 0 otherwise." However, perhaps this is where the functionality 
is missing?

Upon further inspection, it looks like there may be other missing white-space 
characters, such as U+FEFF, "Zero Width No-Break Space". Whatever unicode 
categories their in, they're still a form of white-space and should still be 
removed, no?

This was not the behavior I expected from strip(). 
This affects string.issspace() as well.  I now have to put 
var.strip().strip(u'\u200B\ufeff') anywhere I want to test for whitespace 
strings in all my future python code. (I was bit by exactly this issue in my 
code which is what caused me to file the issue in the first place.)

--

___
Python tracker 
<http://bugs.python.org/issue13391>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)

2011-11-14 Thread Dave Mankoff

Dave Mankoff  added the comment:

But why are they not a space? I mean, they literally have the word space in 
their name and are used as separators between words. I can't really see any 
reason why you wouldn't want this behavior - there's not time when I would be 
thankful that strip removed all spaces except for ZWSP and the likes.

As to deprecation, yes, that is true, but they still exist and will continue to 
do so. (My issue arose when a 3rd party delivered an all whitespace string to 
me.)

I can't really debate this further as there's not much more to say. I hope the 
issue will be reconsidered. Thanks again for taking the time to discuss.

--

___
Python tracker 
<http://bugs.python.org/issue13391>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)

2011-11-14 Thread Dave Mankoff

Dave Mankoff  added the comment:

So I contacted the Unicode Technical Committee about the issue and received a 
promptly received a response back. They pointed that the ZWSP was, once upon a 
time considered white space but that was changed in Unicode 4.0.1

http://www.unicode.org/review/resolved-pri.html#pri21

One particular comment worth noting: "... for historical reasons the general 
category is still Zs (Space Separator)".

Perhaps this ticket can be changed to a feature request? In addition to 
stripping out whitespace, it is useful to remove any non-printable characters 
from a string (or know if a string contains any non-printable characters).

Perhaps a boolean keyword parameter, "control_chars" could be added to isspace 
and strip? Thus:

>>> u' \t\r\n\u200B'.isspace(control_chars=True)
True

--

___
Python tracker 
<http://bugs.python.org/issue13391>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13391] string.strip Does Not Remove Zero-Width-Space (ZWSP)

2011-11-15 Thread Dave Mankoff

Dave Mankoff  added the comment:

"Use regular expressions for more advanced stripping than what the .strip 
method provides."

So I guess this brings me back to my original issue. I'm not looking for 
particularly advanced stripping. I just want to remove all whitespace and other 
non-printing characters. I personally can never think of a time when I wouldn't 
want this (especially with isspace). Maybe in some applications, the control 
characters are useful and shouldn't be stripped, but I would argue that _that_ 
is the more advanced use case for most people.

Thus strip and isspace are now unusable methods in Python for common use cases. 
This seems unfortunate.

I can understand the claims of feature creep. I even understand that having 
isspace compare itself against non-whitespace characters may seem 
counter-intuitive on its face. But certainly there must be a satisfactory 
remedy here.

--

___
Python tracker 
<http://bugs.python.org/issue13391>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com