Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/25/2009 5:22 AM, came the following characters from 
the keyboard of Martin v. Löwis:

The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.


Why is it necessary that you are able to make this distinction?



It is necessary that programs (not me) can make the distinction, so that 
it knows whether or not to do the funny-encoding or not.  If a name is 
funny-decoded when the name is accessed by a directory listing, it needs 
to be funny-encoded in order to open the file.




Picking a character (I don't find U+F01xx in the
Unicode standard, so I don't know what it is)


It's a private use area. It will never carry an official character
assignment.



I know that U+F - U+F is a private use area.  I don't find a 
definition of U+F01xx to know what the notation means.  Are you picking 
a particular character within the private use area, or a particular 
range, or what?




As I realized in the email-sig, in talking about decoding corrupted
headers, there is only one way to guarantee this... to encode _all_
character sequences, from _all_ interfaces.  Basically it requires
reserving an escape character (I'll use ? in these examples -- yes, an
ASCII question mark -- happens to be illegal in Windows filenames so
all the better on that platform, but the specific character doesn't
matter... avoiding / \ and . is probably good, though).


I think you'll have to write an alternative PEP if you want to see
something like this implemented throughout Python.



I'm certainly not experienced enough in Python development processes or 
internals to attempt such, as yet.  But somewhere in 25 years of 
programming, I picked up the knowledge that if you want to have a 1-to-1 
reversible mapping, you have to avoid data puns, mappings of two 
different data values into a single data value.  Your PEP, as first 
written, didn't seem to do that... since there are two interfaces from 
which to obtain data values, one performing a mapping from bytes to 
"funny invalid" Unicode, and the other performing no mapping, but 
accepting any sort of Unicode, possibly including "funny invalid" 
Unicode, the possibility of data puns seems to exist.  I may be 
misunderstanding something about the use cases that prevent these two 
sources of "funny invalid" Unicode from ever coexisting, but if so, 
perhaps you could point it out, or clarify the PEP.  I'll try to reread 
it again... could you post a URL to the most up-to-date version of the 
PEP, since I haven't seen such appear here, and the version I found via 
a Google search seems to be the original?



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 26Apr2009 23:39, Glenn Linderman  wrote:
[...snip...]
> There are still issues regarding how Windows and POSIX programs that are  
> sharing cross-mounted file systems might communicate file names between  
> each other, which is not at all clear from the PEP.  If this is an  
> insoluble or un-addressed issue, it should be stated.  (It is probably  
> insoluble, due to there being multiple ways that the cross-mounted file  
> systems might translate names; but if there are, can we learn something  
> from the rules the mounting systems use, to be compatible with (one of)  
> them, or not.

I'd say that's out of scope. A windows filesystem mounted on a UNIX host
should probably be mounted with a mapping to translate the Windows
Unicode names into whatever the sysadmin deems the locally most apt
byte encoding. But sys.getfilesystemencoding() is based on the current user's
locale settings, which need not be the same.

> Together with your change to avoid using PUA characters, and the rule  
> suggested by MRAB in another branch of this thread, of treating  
> half-surrogates as invalid byte sequences may avoid the data puns I'm  
> concerned about.
>
> It is not clear how half-surrogate characters would be displayed, when  
> the user prints or displays such a file name string.  It would seem that  
> programs that display file names to users might still have issues with  
> such; an escaping mechanism that uses displayable characters would have  
> an advantage there.

Wouldn't any escaping mechanism that uses displayable characters
require visually mangling occurences of those characters that
legitimately occur in the original?
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 12:55 AM, came the following characters from 
the keyboard of Cameron Simpson:

On 26Apr2009 23:39, Glenn Linderman  wrote:
[...snip...]
  
There are still issues regarding how Windows and POSIX programs that are  
sharing cross-mounted file systems might communicate file names between  
each other, which is not at all clear from the PEP.  If this is an  
insoluble or un-addressed issue, it should be stated.  (It is probably  
insoluble, due to there being multiple ways that the cross-mounted file  
systems might translate names; but if there are, can we learn something  
from the rules the mounting systems use, to be compatible with (one of)  
them, or not.



I'd say that's out of scope. A windows filesystem mounted on a UNIX host
should probably be mounted with a mapping to translate the Windows
Unicode names into whatever the sysadmin deems the locally most apt
byte encoding. But sys.getfilesystemencoding() is based on the current user's
locale settings, which need not be the same.
  


And if it were, what would it do with files that can't be encoded with 
the locally most apt byte encoding?  That's where we might learn 
something about what behaviors are deemed acceptable.  Would such files 
be inaccessible?  Accessible with mangled names?  or what?


And for a Unix filesystem mounted on a Windows host?  Or accessed via 
some network connection?



Together with your change to avoid using PUA characters, and the rule  
suggested by MRAB in another branch of this thread, of treating  
half-surrogates as invalid byte sequences may avoid the data puns I'm  
concerned about.


It is not clear how half-surrogate characters would be displayed, when  
the user prints or displays such a file name string.  It would seem that  
programs that display file names to users might still have issues with  
such; an escaping mechanism that uses displayable characters would have  
an advantage there.



Wouldn't any escaping mechanism that uses displayable characters
require visually mangling occurences of those characters that
legitimately occur in the original?
  


Yes.  My suggested use of ? is a visible character that is illegal in 
Windows file names, thus causing no valid Windows file names to be 
visually mangled.  It is also a character that should be avoided in 
POSIX names because:


1) it is known to be illegal on Windows, and thus non-portable
2) it is hard to write globs that match ? without allowing matches of 
other characters as well

3) it must be quoted to specify it on a command line

That said, someone provided a case where it is "easy" to get ? in POSIX 
file names.  The remaining question is whether that is a reasonable use 
case, a frequent use case, or a stupid use case; and whether the 
resulting visible mangling is more or less understandable and disruptive 
than using half-surrogates which are:


1) invalid Unicode
2) non-displayable
3) indistinguishable using normal non-displayable character substitution 
rules


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread R. David Murray

On Mon, 27 Apr 2009 at 01:40, Glenn Linderman wrote:
Yes.  My suggested use of ? is a visible character that is illegal in Windows 
file names, thus causing no valid Windows file names to be visually mangled. 
It is also a character that should be avoided in POSIX names because:


1) it is known to be illegal on Windows, and thus non-portable
2) it is hard to write globs that match ? without allowing matches of other 
characters as well

3) it must be quoted to specify it on a command line

That said, someone provided a case where it is "easy" to get ? in POSIX file 
names.  The remaining question is whether that is a reasonable use case, a 
frequent use case, or a stupid use case; and whether the resulting visible


Reasonable I don't know, but frequent (FSDO frequent) and out of
our control yes.  It happens often when downloading files with wget,
for example.

--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Antoine Pitrou
Stephen J. Turnbull  xemacs.org> writes:
> 
> If
> you see a broken encoding once, you're likely to see it a million times
> (spammers have the most broken software) or maybe have it raise an
> unhandled Exception a dozen times (in rate of using busted software,
> the spammers are closely followed by bosses---which would be very bad,
> eh, if you 2/3 of the mail from your boss ends up in an undeliverables
> queue due to encoding errors that are unhandled by your some filter in
> your mail pipeline).

I'm not sure how mail being stuck in a pipeline has anything to do with Martin's
proposal (which deals with file paths, not with SMTP...).
Besides, I don't care about spammers and their broken software.

> Again, that's not the point.  The point is that six-sigma reliability
> world-wide is not going to be very comforting to the poor souls who
> happen to have broken software in their environment sending broken
> encodings regularly, because they're going to be dealing with one or
> two sigmas, and that's just not good enough in a production
> environment.

So you're arguing that whatever solution which isn't 100% perfect but only
99.999% perfect shouldn't be implemented at all, and leave the status quo at
98%? This sounds disturbing to me.

(especially given you probably sent this mail using TCP/IP...)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Damien Diederen

Hello,

Antoine Pitrou  writes:
> Hello,
>
> We're in the process of forward-porting the recent (massive) json
> updates to 3.1, and we are also thinking of dropping remnants of
> support of the bytes type in the json library (in 3.1, again). This
> bytes support almost didn't work at all, but there was a lot of C and
> Python code for it nevertheless. We're also thinking of dropping the
> "encoding" argument in the various APIs, since it is useless.

I had a quick look into the module on both branches, and at Antoine's
latest patch (json_py3k-3).  The current situation on trunk is indeed
not very pretty in terms of code duplication, and I agree it would be
nice not to carry that forward.

I couldn't figure out a way to get rid of it short of multi-#including
"templates" and playing with the C preprocessor, however, and have the
nagging feeling the latter would be frowned upon by the maintainers.

There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm
wrong about that.  Should I give it a try, and see how "clean" the
result can be made?

> Under the new situation, json would only ever allow str as input, and
> output str as well. By posting here, I want to know whether anybody
> would oppose this (knowing, once again, that bytes support is already
> broken in the current py3k trunk).

Provided one of the alternatives is dropped, wouldn't it be better to do
the opposite, i.e., have the decoder take bytes as input, and the
encoder produce bytes—and layer the str functionality on top of that?  I
guess the answer depends on how the (most common) lower layers are
structured, but it would be nice to allow a straight bytes path to/from
the underlying transport.

(I'm willing to have a go at the conversion in case somebody is
interested.)

Bob, would you have an idea of which lower layers are most commonly used
with the json module, and whether people are more likely to expect strs
or bytes in Python 3.x?  Maybe that data could be inferred from some bug
tracking system?

> The bug entry is: http://bugs.python.org/issue4136
>
> Regards
> Antoine.

Regards,
Damien

-- 
http://crosstwine.com

"Strong Opinions, Weakly Held"
 -- Bob Johansen
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Windows buildbots failing test_types in trunk

2009-04-27 Thread Eric Smith
Mark Dickinson pointed out to me that the trunk buildbots are failing 
under Windows.


After some analysis, I think this is because of a change I made to use 
_toupper in integer formatting. The correct solution to this is to 
implement issue 5793 to come up with a working, cross-platform, 
locale-unaware set of functions and/or macros for isdigit / isupper / 
toupper, etc.


I'll work on this tonight or tomorrow, at which point the Windows 
buildbots should turn green.


I don't think this affects py3k, although I'll port it there before the 
beta release.


Eric.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Eric Smith
> I couldn't figure out a way to get rid of it short of multi-#including
> "templates" and playing with the C preprocessor, however, and have the
> nagging feeling the latter would be frowned upon by the maintainers.

Not sure if this is exactly what you mean, but look at Objects/stringlib.
str.format() and unicode.format() share the same implementation, using
stringdefs.h and unicodedefs.h.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Bob Ippolito
On Mon, Apr 27, 2009 at 7:25 AM, Damien Diederen  wrote:
>
> Antoine Pitrou  writes:
>> Hello,
>>
>> We're in the process of forward-porting the recent (massive) json
>> updates to 3.1, and we are also thinking of dropping remnants of
>> support of the bytes type in the json library (in 3.1, again). This
>> bytes support almost didn't work at all, but there was a lot of C and
>> Python code for it nevertheless. We're also thinking of dropping the
>> "encoding" argument in the various APIs, since it is useless.
>
> I had a quick look into the module on both branches, and at Antoine's
> latest patch (json_py3k-3).  The current situation on trunk is indeed
> not very pretty in terms of code duplication, and I agree it would be
> nice not to carry that forward.
>
> I couldn't figure out a way to get rid of it short of multi-#including
> "templates" and playing with the C preprocessor, however, and have the
> nagging feeling the latter would be frowned upon by the maintainers.
>
> There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm
> wrong about that.  Should I give it a try, and see how "clean" the
> result can be made?
>
>> Under the new situation, json would only ever allow str as input, and
>> output str as well. By posting here, I want to know whether anybody
>> would oppose this (knowing, once again, that bytes support is already
>> broken in the current py3k trunk).
>
> Provided one of the alternatives is dropped, wouldn't it be better to do
> the opposite, i.e., have the decoder take bytes as input, and the
> encoder produce bytes—and layer the str functionality on top of that?  I
> guess the answer depends on how the (most common) lower layers are
> structured, but it would be nice to allow a straight bytes path to/from
> the underlying transport.
>
> (I'm willing to have a go at the conversion in case somebody is
> interested.)
>
> Bob, would you have an idea of which lower layers are most commonly used
> with the json module, and whether people are more likely to expect strs
> or bytes in Python 3.x?  Maybe that data could be inferred from some bug
> tracking system?

I don't know what Python 3.x users expect. As far as I know, none of
the lower layers of the json package are used directly. They're
certainly not supposed to be or documented as such.

My use case for dumps is typically bytes output because we push it
straight to and from IO. Some people embed JSON in other documents
(e.g. HTML) where you would want it to be text. I'm pretty sure that
the IO case is more common.

-bob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Damien Diederen

Hi Eric,

"Eric Smith"  writes:
>> I couldn't figure out a way to get rid of it short of multi-#including
>> "templates" and playing with the C preprocessor, however, and have the
>> nagging feeling the latter would be frowned upon by the maintainers.
>
> Not sure if this is exactly what you mean, but look at Objects/stringlib.
> str.format() and unicode.format() share the same implementation, using
> stringdefs.h and unicodedefs.h.

That's indeed a much better example!  I'm more confortable applying the
same technique to the json module now that I see it used in the core.

(Provided Bob and Antoine are not turned away by the relative ugliness,
that is.)

> Eric.

Cheers,
Damien

--
http://crosstwine.com

"Strong Opinions, Weakly Held"
 -- Bob Johansen
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Antoine Pitrou
Damien Diederen  crosstwine.com> writes:
> 
> I couldn't figure out a way to get rid of it short of multi-#including
> "templates" and playing with the C preprocessor, however, and have the
> nagging feeling the latter would be frowned upon by the maintainers.
> 
> There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm
> wrong about that.  Should I give it a try, and see how "clean" the
> result can be made?

Keep in mind that json is externally maintained by Bob. The more we rework his
code, the less easy it will be to backport other changes from the simplejson
library.

I think we should either keep the code duplication (if we want to keep fast
paths for both bytes and str objects), or only keep one of the two versions as
my patch does.

> Provided one of the alternatives is dropped, wouldn't it be better to do
> the opposite, i.e., have the decoder take bytes as input, and the
> encoder produce bytes—and layer the str functionality on top of that?  I
> guess the answer depends on how the (most common) lower layers are
> structured, but it would be nice to allow a straight bytes path to/from
> the underlying transport.

The straightest path is actually to/from unicode, since JSON data can contain
unicode strings but no byte strings. Also, the json library /has/ to output
unicode when `ensure_ascii` is False. In 2.x:

>>> json.dumps([u"éléphant"], ensure_ascii=False)
u'["\xe9l\xe9phant"]'

In any case, I don't think it will matter much in terms of speed whether we take
one route or the other. UTF-8 encoding/decoding is probably much faster (in
characters per second) than JSON encoding/decoding is.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull
Antoine Pitrou writes:

 > I'm not sure how mail being stuck in a pipeline has anything to do
 > with Martin's proposal (which deals with file paths, not with
 > SMTP...).

I hate to break it to you, but most stages of mail processing have
very little to do with SMTP.  In particular, processing MIME
attachments often requires dealing with file names.  Would practical
problems arise?  I expect they would.  Can I tell you what they are?
No; if I could I'd write a better PEP.  I'm just saying that my
experience is that Murphy's Law applies more to encoding processing
than any other area of software I've worked in (admittedly, I don't do
threads ;-).

 > Besides, I don't care about spammers and their broken software.

That's precisely my point.  The PEP's "solution" will be very
appealing to people who just don't care as long as it works for them,
in the subset of corner cases they happen to encounter.  A lot of
software, including low-level components, will be written using these
APIs, and they will result in escapes of uninterpreted bytes (encoded
as Unicode) into the textual world.

 > So you're arguing that whatever solution which isn't 100% perfect
 > but only 99.999% perfect shouldn't be implemented at all, and leave
 > the status quo at 98%?

No, I'm not talking about "whatever solution".  I'm only arguing about
PEP 383.  The point is that Martin's proposal is not just a solution
to the problem he posed.  It's also going to be the one obvious way to
make the usual mistakes, i.e., the return values will escape into code
paths they're not intended for.  And the APIs won't be killable until
Python 4000.  If we find a better way (which I think Python 3's move
to "text is Unicode" is likely to inspire!), we'll have to wait 10-15
years or more before it becomes the OOWTDI.  The only real hope about
that is that Unicode will become universal before that, and only
archaeologists will ever encounter malformed text.

I believe there are solutions that don't have that problem.
Specifically, if the return values were bytes, or (better for 2.x,
where bytes are strings as far as most programmers are concerned) as a
new data type, to indicate that they're not text until the client
acknowledges them as such.  EIBTI.

Unfortunately, Martin clearly doesn't intend to make such a change to
the PEP.  I don't have the time or the Python expertise to generate an
alternative PEP. :-(  I do have long experience with the pain of
dealing with encoding issues caused by APIs that are intended to DTRT,
conveniently.  Martin's is better than most, but I just don't think
convenience and robustness can be combined in this area.

 > This sounds disturbing to me.

BTW, I'm on record as +0 on the PEP.  I don't think the better
proposals have a chance, because most people *want* the non-solution
that they can just use as a habit, allowing Python to make decisions
that should be made by the application, and not have to do
"unnecessary" conversions and the like.  It's not obvious to me that
it should not be given to them, but I don't much like it.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Paul Moore
2009/4/27 Stephen J. Turnbull :
> I believe there are solutions that don't have that problem.
> Specifically, if the return values were bytes, or (better for 2.x,
> where bytes are strings as far as most programmers are concerned) as a
> new data type, to indicate that they're not text until the client
> acknowledges them as such.  EIBTI.

I think you're ignoring the fact that under Windows, it's the *bytes*
APIs that are lossy.

Can I at least assume that you aren't recommending that only the bytes
API exists on Unix, and only the Unicode API on Windows?

So what's your suggestion?

> Unfortunately, Martin clearly doesn't intend to make such a change to
> the PEP.  I don't have the time or the Python expertise to generate an
> alternative PEP. :-(  I do have long experience with the pain of
> dealing with encoding issues caused by APIs that are intended to DTRT,
> conveniently.  Martin's is better than most, but I just don't think
> convenience and robustness can be combined in this area.

The *only* "robust" solution is to completely separate the 2
platforms. Which helps no-one, and is at least as bad as the 2.x
situation. (Probably worse).

> BTW, I'm on record as +0 on the PEP.  I don't think the better
> proposals have a chance, because most people *want* the non-solution
> that they can just use as a habit, allowing Python to make decisions
> that should be made by the application, and not have to do
> "unnecessary" conversions and the like.  It's not obvious to me that
> it should not be given to them, but I don't much like it.

People *want* a solution that doesn't require every application
developer to sweat blood to write working code, simply to cover corner
cases that they don't believe will happen. Not every application is a
24x7 server, and all that. Similarly, not every application is a
backup program. Such applications have unique issues, which the
developers should (but don't always, admittedly!) understand. The rest
of us don't want to be made to care.

It's not sloppiness. It's a realistic appreciation of the requirements
of the application. (And an acceptance that not every bug must be
fixed before release).

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System?Character?Interfaces

2009-04-27 Thread Aahz
On Mon, Apr 27, 2009, Antoine Pitrou wrote:
> Stephen J. Turnbull  xemacs.org> writes:
>> 
>> If
>> you see a broken encoding once, you're likely to see it a million times
>> (spammers have the most broken software) or maybe have it raise an
>> unhandled Exception a dozen times (in rate of using busted software,
>> the spammers are closely followed by bosses---which would be very bad,
>> eh, if you 2/3 of the mail from your boss ends up in an undeliverables
>> queue due to encoding errors that are unhandled by your some filter in
>> your mail pipeline).
> 
> Besides, I don't care about spammers and their broken software.

Maybe you don't, but anyone who has to process random messages does; you
have to assume that messages will be broken.
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur."  --Red Adair
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Antoine Pitrou
Stephen J. Turnbull  xemacs.org> writes:
> 
> I hate to break it to you, but most stages of mail processing have
> very little to do with SMTP.  In particular, processing MIME
> attachments often requires dealing with file names.

AFAIK, the file name is only there as an indication for the user when he wants
to save the file. If it's garbled a bit, no big deal.

> The point is that Martin's proposal is not just a solution
> to the problem he posed.

But you haven't concretely demonstrated it with actual use cases. The problems
that the PEP tries to solve, conversely, /have/ been experienced.

> And the APIs won't be killable until
> Python 4000.

Which APIs? The PEP doesn't propose any new API, it just enhances the
implementation of current APIs so that they work out of the box in all cases.

> Specifically, if the return values were bytes,

... it would make Windows support worse.

> or (better for 2.x,
> where bytes are strings as far as most programmers are concerned) as a
> new data type,

I'm -1 on any new string-like type (for file paths or whatever else) with custom
encoding/decoding semantics. It's the best way to ruin the clean str/bytes
separation that 3.x introduced.

Besides, the goal is also to makes things easier for the programmer. Otherwise,
we'll have the same situation as in 2.x where many English-centric programmers
produced code that was incapable of dealing with non-ASCII input, because they
didn't care about the distinction between str and unicode.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Damien Diederen

Hi Antoine,

Antoine Pitrou  writes:
> Damien Diederen  crosstwine.com> writes:
>> I couldn't figure out a way to get rid of it short of multi-#including
>> "templates" and playing with the C preprocessor, however, and have the
>> nagging feeling the latter would be frowned upon by the maintainers.
>> 
>> There is a precedent with xmltok.c/xmltok_impl.c, though, so maybe I'm
>> wrong about that.  Should I give it a try, and see how "clean" the
>> result can be made?
>
> Keep in mind that json is externally maintained by Bob. The more we rework his
> code, the less easy it will be to backport other changes from the simplejson
> library.
>
> I think we should either keep the code duplication (if we want to keep fast
> paths for both bytes and str objects), or only keep one of the two versions as
> my patch does.

Yes, I was (slowly) reaching the same conclusion.

>> Provided one of the alternatives is dropped, wouldn't it be better to do
>> the opposite, i.e., have the decoder take bytes as input, and the
>> encoder produce bytes—and layer the str functionality on top of that?  I
>> guess the answer depends on how the (most common) lower layers are
>> structured, but it would be nice to allow a straight bytes path to/from
>> the underlying transport.
>
> The straightest path is actually to/from unicode, since JSON data can contain
> unicode strings but no byte strings. Also, the json library /has/ to output
> unicode when `ensure_ascii` is False. In 2.x:
>
 json.dumps([u"éléphant"], ensure_ascii=False)
> u'["\xe9l\xe9phant"]'
>
> In any case, I don't think it will matter much in terms of speed
> whether we take one route or the other. UTF-8 encoding/decoding is
> probably much faster (in characters per second) than JSON
> encoding/decoding is.

You're undoubtedly right.  I was more concerned about the interaction
with other modules, and avoiding unnecessary copies/conversions
especially when they don't make sense from the user's perspective.

I will whip up a patch adding a {loadb,dumpb} API as you suggested in
another email, with the most trivial implementation, and then we'll see
where to go from there.

It can still be dropped if there is a concern of perpetuating a "bad
idea," or I can follow up with a port of Bob's "bytes" implementation
from 2.x if there is any interest.

> Regards
> Antoine.

Cheers,
Damien

-- 
http://crosstwine.com

"Strong Opinions, Weakly Held"
 -- Bob Johansen
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] 2.6.2 Vista installer failure on upgrade from 2.6.1

2009-04-27 Thread Jim Kleckner
I went to upgrade a Vista machine from 2.6.1 to 2.6.2 and got error 2755 
with the message "system cannot open the device or file".


I uninstalled 2.6.1, removing all residual files also, and got the error 
message again.


When I ran msiexec as follows to get a log, it magically worked:
 msiexec /i python-2.6.2.msi  /l*v install.log

Should I attempt to explore this further or just be happy?

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull
Paul Moore writes:
 > 2009/4/27 Stephen J. Turnbull :
 > > I believe there are solutions that don't have that problem.
 > > Specifically, if the return values were bytes, or (better for 2.x,
 > > where bytes are strings as far as most programmers are concerned) as a
 > > new data type, to indicate that they're not text until the client
 > > acknowledges them as such.  EIBTI.
 > 
 > I think you're ignoring the fact that under Windows, it's the *bytes*
 > APIs that are lossy.

The *Windows* bytes APIs may be lossy.  Python's bytes on the other
hand can represent anything that UTF-16 can.  Just represented as
UTF-8.  The point is that in Python 3 "bytes" means it's *your*
responsibility, not Python's, to decode that data.  The advantage of a
new data type is that Python can provide ways to do it and hide the
internal representation (in theory, it could even be different for the
different platforms).

 > Can I at least assume that you aren't recommending that only the bytes
 > API exists on Unix, and only the Unicode API on Windows?

I'm agnostic about the underlying APIs used to talk to the OS; people
who actually use that OS should decide that.  I'm just recommending
that the return values of the getters not be of a "character string"
type until converted explicitly by the application.

 > The *only* "robust" solution is to completely separate the 2
 > platforms.

I'm not so pessimistic, unless you're referring to Microsoft's
penchant for forking any solution they don't own.

 > People *want* a solution that doesn't require every application
 > developer to sweat blood to write working code, simply to cover
 > corner cases that they don't believe will happen.  The rest of us
 > don't want to be made to care.

Well, yes, I wrote pretty much the same thing in the post you're
replying to.  But do you really think PEP 383 as written is the unique
solution to those requirements?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Stephen J. Turnbull
Antoine Pitrou writes:

 > > or (better for 2.x, where bytes are strings as far as most
 > > programmers are concerned) as a new data type,
 > 
 > I'm -1 on any new string-like type (for file paths or whatever
 > else) with custom encoding/decoding semantics. It's the best way to
 > ruin the clean str/bytes separation that 3.x introduced.

Excuse me, but I can't see a scheme that encodes bytes as Unicodes but
only sometimes as a "clean separation".  It's a dirty hack that makes
life a lot easier for Windows programmers and a little easier for many
Unix programmers.  Practicality beats purity, true, but at the cost of
the purity.

 > Besides, the goal is also to makes things easier for the
 > programmer. Otherwise, we'll have the same situation as in 2.x
 > where many English-centric programmers produced code that was
 > incapable of dealing with non-ASCII input, because they didn't care
 > about the distinction between str and unicode.

So what you'll get here, AFAICS, is a new situation where many
Windows-centric programmers will produce code that's incapable of
dealing with non-Unicode input because they don't have to care about
the distinction between Unicode and bytes.

That's an improvement, but we can do still better and not at huge
expense to programmers.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Tony Nelson
At 23:39 -0700 04/26/2009, Glenn Linderman wrote:
>On approximately 4/25/2009 5:35 AM, came the following characters from
>the keyboard of Martin v. Löwis:
>>> Because the encoding is not reliably reversible.
>>
>> Why do you say that? The encoding is completely reversible
>> (unless we disagree on what "reversible" means).
>>
>>> I'm +1 on the concept, -1 on the PEP, due solely to the lack of a
>>> reversible encoding.
>>
>> Then please provide an example for a setup where it is not reversible.
>>
>> Regards,
>> Martin
>
>It is reversible if you know that it is decoded, and apply the encoding.
>  But if you don't know that has been encoded, then applying the reverse
>transform can convert an undecoded str that matches the decoded str to
>the form that it could have, but never did take.
>
>The problem is that there is no guarantee that the str interface
>provides only strictly conforming Unicode, so decoding bytes to
>non-strictly conforming Unicode, can result in a data pun between
>non-strictly conforming Unicode coming from the str interface vs bytes
>being decoded to non-strictly conforming Unicode coming from the bytes
>interface.
 ...

Maybe this is a dumb idea, but some people might be reassured if the
half-surrogates had some particular pattern that is unlikely to occur even
in unreasonable text (as half-surrogates are an error in Unicode).  The
pattern could be some sequence of half-surrogate encoded bytes, framing the
intended data, as is done for RFC 2047 internationalized header fields in
email.  It would take up a few more bytes in the string, but no matter.  It
would also make it easier to diagnose when decoding was not properly done.

FWIW, I like the idea in the PEP, now that I think I understand it.

(BTW, gotta love what the email package is doing to the Subject: header
field. ;-')
-- 

TonyN.:'   
  '  
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Tony Nelson
At 16:09 + 04/27/2009, Antoine Pitrou wrote:
>Stephen J. Turnbull  xemacs.org> writes:
>>
>> I hate to break it to you, but most stages of mail processing have
>> very little to do with SMTP.  In particular, processing MIME
>> attachments often requires dealing with file names.
>
>AFAIK, the file name is only there as an indication for the user when he wants
>to save the file. If it's garbled a bit, no big deal.
 ...

Yep.  In fact, it should be cleaned carefully.  RFC 2183, 2.3:

"It is important that the receiving MUA not blindly use the suggested
filename.  The suggested filename SHOULD be checked (and possibly
changed) to see that it conforms to local filesystem conventions,
does not overwrite an existing file, and does not present a security
problem (see Security Considerations below).

The receiving MUA SHOULD NOT respect any directory path information
that may seem to be present in the filename parameter.  The filename
should be treated as a terminal component only.  Portable
specification of directory paths might possibly be done in the future
via a separate Content Disposition parmeter, but no provision is
made for it in this draft."

-- 

TonyN.:'   
  '  
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Antoine Pitrou
Stephen J. Turnbull  xemacs.org> writes:
> 
> Excuse me, but I can't see a scheme that encodes bytes as Unicodes but
> only sometimes as a "clean separation".

Yet it is. Filenames are all unicode, without exception, and there's no implicit
conversion to bytes. That's a clean separation.

> So what you'll get here, AFAICS, is a new situation where many
> Windows-centric programmers will produce code that's incapable of
> dealing with non-Unicode input because they don't have to care about
> the distinction between Unicode and bytes.

I don't understand what you're saying. py3k filenames are all unicode, even on
POSIX systems, so where is the problem with/for Windows programmers?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UTF-8 Decoder

2009-04-27 Thread Jeroen Ruigrok van der Werven
-On [20090414 16:43], Antoine Pitrou (solip...@pitrou.net) wrote:
>If you have some time on your hands, you could try benchmarking it against
>Python 3.1's (py3k) decoder. There are two cases to consider:

Bjoern actually did it himself already:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#performance

(results are Large, Medium, Tiny)

PyUnicode_DecodeUTF8Stateful (3.1a2), Visual C++ 7.1 -Ox -Ot -G7
4523ms  5686ms  3138ms

Manually inlined transcoder (see above), Visual C++ 7.1 -Ox -Ot -G7
4277ms  4998ms  4640ms

So on medium and large datasets the decoder of Bjoern is very interesting,
but the tiny case (just Bjoern's name) is quite a tad bit slower. The other
cases seems more typical of what the average use in Python would be.

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Nobilitas sola est atque unica virtus...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] UTF-8 Decoder

2009-04-27 Thread Antoine Pitrou
Jeroen Ruigrok van der Werven  in-nomine.org> writes:
> 
> So on medium and large datasets the decoder of Bjoern is very interesting,
> but the tiny case (just Bjoern's name) is quite a tad bit slower. The other
> cases seems more typical of what the average use in Python would be.

Keep in mind what the datasets are:

« The large buffer is a April 2009 Hindi Wikipedia article XML dump, the medium
buffer Markus Kuhn's UTF-8-demo.txt, and the tiny buffer my name »

It would be interesting to test with mostly ASCII data to see what that gives.
Now the good thing is that, even with wildly non-ASCII data, our current decoder
is very efficient.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
>> It's a private use area. It will never carry an official character
>> assignment.
> 
> 
> I know that U+F - U+F is a private use area.  I don't find a
> definition of U+F01xx to know what the notation means.  Are you picking
> a particular character within the private use area, or a particular
> range, or what?

It's a range. The lower-case 'x' denotes a variable half-byte, ranging
from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code
points.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
>>> There are still issues regarding how Windows and POSIX programs that
>>> are  sharing cross-mounted file systems might communicate file names
>>> between  each other, which is not at all clear from the PEP.  If this
>>> is an  insoluble or un-addressed issue, it should be stated.  (It is
>>> probably  insoluble, due to there being multiple ways that the
>>> cross-mounted file  systems might translate names; but if there are,
>>> can we learn something  from the rules the mounting systems use, to
>>> be compatible with (one of)  them, or not.
>>> 
>>
>> I'd say that's out of scope. A windows filesystem mounted on a UNIX host
>> should probably be mounted with a mapping to translate the Windows
>> Unicode names into whatever the sysadmin deems the locally most apt
>> byte encoding. But sys.getfilesystemencoding() is based on the current
>> user's locale settings, which need not be the same.
>>   
> 
> And if it were, what would it do with files that can't be encoded with
> the locally most apt byte encoding? 

As Cameron says: it's out of the scope of the PEP. It really depends how
the operating system deals with them. Most likely, the files are not
accessible - not only not from Python, but also not accessible from
any other Unix program. Details depend on the specific operating system
software being used, and the specific parameters passed to it.

> That's where we might learn
> something about what behaviors are deemed acceptable.  Would such files
> be inaccessible?  Accessible with mangled names?  or what?

Difficult to tell. What operating system did you use, and what mount
options did you pass?

> And for a Unix filesystem mounted on a Windows host?  Or accessed via
> some network connection?

Same issue really: what specific mounting software did you use? Windows
cannot mount Unix file systems on its own, or through some network
connection.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 2.6.2 Vista installer failure on upgrade from 2.6.1

2009-04-27 Thread Martin v. Löwis
Jim Kleckner wrote:
> I went to upgrade a Vista machine from 2.6.1 to 2.6.2 and got error 2755
> with the message "system cannot open the device or file".
> 
> I uninstalled 2.6.1, removing all residual files also, and got the error
> message again.
> 
> When I ran msiexec as follows to get a log, it magically worked:
>  msiexec /i python-2.6.2.msi  /l*v install.log
> 
> Should I attempt to explore this further or just be happy?

Where you by an chance using a SUBSTed drive? If so, just be happy:
this is a known limitation (of Windows installer).

Otherwise, if you can contribute a useful bug report (or even a patch),
please go ahead. I would try to turn logging on through the registry and
see whether that gives any insight.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 00:07, Glenn Linderman  wrote:
> On approximately 4/25/2009 5:22 AM, came the following characters from  
> the keyboard of Martin v. Löwis:
>>> The problem with this, and other preceding schemes that have been
>>> discussed here, is that there is no means of ascertaining whether a
>>> particular file name str was obtained from a str API, or was funny-
>>> decoded from a bytes API... and thus, there is no means of reliably
>>> ascertaining whether a particular filename str should be passed to a
>>> str API, or funny-encoded back to bytes.
>>
>> Why is it necessary that you are able to make this distinction?
>
>
> It is necessary that programs (not me) can make the distinction, so that  
> it knows whether or not to do the funny-encoding or not.

I would say this isn't so. It's important that programs know if they're
dealing with strings-for-filenames, but not that they be able to figure
that out "a priori" if handed a bare string (especially since they
can't:-)

> If a name is  
> funny-decoded when the name is accessed by a directory listing, it needs  
> to be funny-encoded in order to open the file.

Hmm. I had thought that legitimate unicode strings already get transcoded
to bytes via the mapping specified by sys.getfilesystemencoding()
(the user's locale). That already happens I believe, and Martin's
scheme doesn't change this. He's just funny-encoding non-decodable byte
sequences, not the decoded stuff that surrounds them.

So it is already the case that strings get decoded to bytes by
calls like open(). Martin isn't changing that.

I suppose if your program carefully constructs a unicode string riddled
with half-surrogates etc and imagines something specific should happen
to them on the way to being POSIX bytes then you might have a problem...

I think the advantage to Martin's choice of encoding-for-undecodable-bytes
is that it _doesn't_ use normal characters for the special bits. This
means that _all_ normal characters are left unmangled un both "bare"
and "funny-encoded" strings.

Because of that, I now think I'm -1 on your "use printable characters
for the encoding". I think presentation of the special characters
_should_ look bogus in an app (eg little rectangles or whatever in a
GUI); it's a fine flashing red light to the user.

Also, by avoiding reuse of legitimate characters in the encoding we can
avoid your issue with losing track of where a string came from;
legitimate characters are currently untouched by Martin's scheme, except
for the normal "bytes<->string via the user's locale" translation that
must already happen, and there you're aided by byets and strings being
different types.

> I'm certainly not experienced enough in Python development processes or  
> internals to attempt such, as yet.  But somewhere in 25 years of  
> programming, I picked up the knowledge that if you want to have a 1-to-1  
> reversible mapping, you have to avoid data puns, mappings of two  
> different data values into a single data value.  Your PEP, as first  
> written, didn't seem to do that... since there are two interfaces from  
> which to obtain data values, one performing a mapping from bytes to  
> "funny invalid" Unicode, and the other performing no mapping, but  
> accepting any sort of Unicode, possibly including "funny invalid"  
> Unicode, the possibility of data puns seems to exist.  I may be  
> misunderstanding something about the use cases that prevent these two  
> sources of "funny invalid" Unicode from ever coexisting, but if so,  
> perhaps you could point it out, or clarify the PEP.

Please elucidate the "second source" of strings. I'm presuming you mean
strings egenrated from scratch rather than obtained by something like
listdir().

Given such a string with "funny invalid" stuff in it, and _absent_
Martin's scheme, what do you expect the source of the strings to _expect_
to happen to them if passed to open()? They still have to be converted
to bytes at the POSIX layer anyway.

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Heaven could change from chocolate to vanilla without violating perfection.
- arrom...@jyusenkyou.cs.jhu.edu (Ken Arromdee)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Simon Cross
On Mon, Apr 27, 2009 at 9:48 PM, "Martin v. Löwis"  wrote:
> As Cameron says: it's out of the scope of the PEP. It really depends how
> the operating system deals with them. Most likely, the files are not
> accessible - not only not from Python, but also not accessible from
> any other Unix program. Details depend on the specific operating system
> software being used, and the specific parameters passed to it.

$ touch $'\xFF\xAA\xFF'
$ vi $'\xFF\xAA\xFF'
$ egrep foo $'\xFF\xAA\xFF'

All worked fine from my Bash shell with locale encoding set to UTF-8.
I can also open the created file from the GNOME editor file dialog (it
even tells me the filename is not valid in my locale's encoding). The
Nedit editor also worked. So far I haven't found anything that failed.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
> $ touch $'\xFF\xAA\xFF'
> $ vi $'\xFF\xAA\xFF'
> $ egrep foo $'\xFF\xAA\xFF'
> 
> All worked fine from my Bash shell with locale encoding set to UTF-8.
> I can also open the created file from the GNOME editor file dialog (it
> even tells me the filename is not valid in my locale's encoding). The
> Nedit editor also worked. So far I haven't found anything that failed.

So what SMB server did you mount here, using what software, and what
mount options?

I think you might be referring to an entirely different use case.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Antoine Pitrou
Simon Cross  gmail.com> writes:
> 
> $ touch $'\xFF\xAA\xFF'
> $ vi $'\xFF\xAA\xFF'
> $ egrep foo $'\xFF\xAA\xFF'
> 
> All worked fine from my Bash shell with locale encoding set to UTF-8.

The PEP is precisely about making py3k able to better handle these files (right
now os.listdir() doesn't return the offending file in its list of results).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Michael Foord

Stephen J. Turnbull wrote:

Antoine Pitrou writes:

 > > or (better for 2.x, where bytes are strings as far as most
 > > programmers are concerned) as a new data type,
 > 
 > I'm -1 on any new string-like type (for file paths or whatever

 > else) with custom encoding/decoding semantics. It's the best way to
 > ruin the clean str/bytes separation that 3.x introduced.

Excuse me, but I can't see a scheme that encodes bytes as Unicodes but
only sometimes as a "clean separation".  It's a dirty hack that makes
life a lot easier for Windows programmers and a little easier for many
Unix programmers.  Practicality beats purity, true, but at the cost of
the purity.

  


The problem you don't address, which is still the reality for most 
programmers (especially Mac OS X where filesystem encoding is UTF 8), is 
that programmers *are* going to treat filenames as strings.


The proposed PEP allows that to work for them - whatever platform their 
program runs on.


Michael


 > Besides, the goal is also to makes things easier for the
 > programmer. Otherwise, we'll have the same situation as in 2.x
 > where many English-centric programmers produced code that was
 > incapable of dealing with non-ASCII input, because they didn't care
 > about the distinction between str and unicode.

So what you'll get here, AFAICS, is a new situation where many
Windows-centric programmers will produce code that's incapable of
dealing with non-Unicode input because they don't have to care about
the distinction between Unicode and bytes.

That's an improvement, but we can do still better and not at huge
expense to programmers.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
  



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 12:42 PM, came the following characters from 
the keyboard of Martin v. Löwis:

It's a private use area. It will never carry an official character
assignment.


I know that U+F - U+F is a private use area.  I don't find a
definition of U+F01xx to know what the notation means.  Are you picking
a particular character within the private use area, or a particular
range, or what?


It's a range. The lower-case 'x' denotes a variable half-byte, ranging
from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code
points.



So you only need 128 code points, so there is something else unclear.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 12:48 PM, came the following characters from 
the keyboard of Martin v. Löwis:

There are still issues regarding how Windows and POSIX programs that
are  sharing cross-mounted file systems might communicate file names
between  each other, which is not at all clear from the PEP.  If this
is an  insoluble or un-addressed issue, it should be stated.  (It is
probably  insoluble, due to there being multiple ways that the
cross-mounted file  systems might translate names; but if there are,
can we learn something  from the rules the mounting systems use, to
be compatible with (one of)  them, or not.



I'd say that's out of scope. A windows filesystem mounted on a UNIX host
should probably be mounted with a mapping to translate the Windows
Unicode names into whatever the sysadmin deems the locally most apt
byte encoding. But sys.getfilesystemencoding() is based on the current
user's locale settings, which need not be the same.
  
  

And if it were, what would it do with files that can't be encoded with
the locally most apt byte encoding? 



As Cameron says: it's out of the scope of the PEP. It really depends how
the operating system deals with them. Most likely, the files are not
accessible - not only not from Python, but also not accessible from
any other Unix program. Details depend on the specific operating system
software being used, and the specific parameters passed to it.
  



I'm not suggesting the PEP should solve the problem of mounting foreign 
file systems, although if it doesn't it should probably point that out.  
I'm just suggesting that if the people that write software to solve the 
problem of mounting foreign file systems have already solved the naming 
problem, then it might be a source of a good solution.  On the other 
hand, it might be the source of a mediocre or bad solution.  However, if 
those mounting system have good solutions, it would be good to be 
compatible with them, rather than have yet another solution.  It was in 
that sense, of thinking about possibly existing practice, and leveraging 
an existing solution, that caused me to bring up the topic.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Steven D'Aprano
On Tue, 28 Apr 2009 04:13:47 am Antoine Pitrou wrote:
> Stephen J. Turnbull  xemacs.org> writes:
...
> > So what you'll get here, AFAICS, is a new situation where many
> > Windows-centric programmers will produce code that's incapable of
> > dealing with non-Unicode input because they don't have to care
> > about the distinction between Unicode and bytes.
>
> I don't understand what you're saying. py3k filenames are all
> unicode, even on POSIX systems, 


How is that possible on POSIX systems where the underlying file system 
uses bytes for filenames?

If I write a piece of Python code:

filename = 'some path/some name'

I might call it a filename, I might think of it as a filename, but it 
*isn't*, it's a string in a Python program. It isn't a filename until 
it hits the file system, and in POSIX systems that makes it bytes.



-- 
Steven D'Aprano
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 23:27, Simon Cross  wrote:
| On Mon, Apr 27, 2009 at 9:48 PM, "Martin v. Löwis"  wrote:
| > As Cameron says: it's out of the scope of the PEP. It really depends how
| > the operating system deals with them. Most likely, the files are not
| > accessible - not only not from Python, but also not accessible from
| > any other Unix program. Details depend on the specific operating system
| > software being used, and the specific parameters passed to it.
| 
| $ touch $'\xFF\xAA\xFF'
| $ vi $'\xFF\xAA\xFF'
| $ egrep foo $'\xFF\xAA\xFF'
| 
| All worked fine from my Bash shell with locale encoding set to UTF-8.
| I can also open the created file from the GNOME editor file dialog (it
| even tells me the filename is not valid in my locale's encoding). The
| Nedit editor also worked. So far I haven't found anything that failed.

Yes, they would. Are you doing that on a real UNIX filesystem
(ext2/3/4, XFS etc)?

I'm not sure whether you're arguing for or against the propsal here,
btw.

This would make a file with a presumably UTF-8-invalid name. Martin's
proposal would cheerfully map that losslessly to a string. Is there a
problem here?
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Stepwise Refinement n.  A sequence of kludges K, neither distinct or finite,
applied to a program P aimed at transforming it into the target program Q.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Benjamin Peterson
2009/4/27 Cameron Simpson :
> I think that, almost independent of this PEP, there should be an
> os.fsencode() function that takes a byte string (as a POSIX OS call
> will take) and performs the _same_ byte->string encoding that listdir()
> and friends are doing under the hood. And a partner os.fsdecode() for
> string->bytes. That will save a lot of wheel respoking and probably make
> it easier for people to think about this.

some_path.encode(sys.getfilesystemencoding())



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 2:14 PM, came the following characters from 
the keyboard of Cameron Simpson:

On 27Apr2009 00:07, Glenn Linderman  wrote:
  
On approximately 4/25/2009 5:22 AM, came the following characters from  
the keyboard of Martin v. Löwis:


The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.


Why is it necessary that you are able to make this distinction?
  
It is necessary that programs (not me) can make the distinction, so that  
it knows whether or not to do the funny-encoding or not.



I would say this isn't so. It's important that programs know if they're
dealing with strings-for-filenames, but not that they be able to figure
that out "a priori" if handed a bare string (especially since they
can't:-)
  


So you agree they can't... that there are data puns.   (OK, you may not 
have thought that through)



If a name is  
funny-decoded when the name is accessed by a directory listing, it needs  
to be funny-encoded in order to open the file.



Hmm. I had thought that legitimate unicode strings already get transcoded
to bytes via the mapping specified by sys.getfilesystemencoding()
(the user's locale). That already happens I believe, and Martin's
scheme doesn't change this. He's just funny-encoding non-decodable byte
sequences, not the decoded stuff that surrounds them.
  


So assume a non-decodable sequence in a name.  That puts us into 
Martin's funny-decode scheme.  His funny-decode scheme produces a bare 
string, indistinguishable from a bare string that would be produced by a 
str API that happens to contain that same sequence.  Data puns.


So when open is handed the string, should it open the file with the name 
that matches the string, or the file with the name that funny-decodes to 
the same string?  It can't know, unless it knows that the string is a 
funny-decoded string or not.



So it is already the case that strings get decoded to bytes by
calls like open(). Martin isn't changing that.
  


I thought the process of converting strings to bytes is called 
encoding.  You seem to be calling it decoding?




I suppose if your program carefully constructs a unicode string riddled
with half-surrogates etc and imagines something specific should happen
to them on the way to being POSIX bytes then you might have a problem...
  


Right.  Or someone else's program does that.  I only want to use Unicode 
file names.  But if those other file names exist, I want to be able to 
access them, and not accidentally get a different file.



I think the advantage to Martin's choice of encoding-for-undecodable-bytes
is that it _doesn't_ use normal characters for the special bits. This
means that _all_ normal characters are left unmangled un both "bare"
and "funny-encoded" strings.
  


Whether the characters used for funny decoding are normal or abnormal, 
unless they are prevented from also appearing in filenames when they are 
obtained from or passed to other APIs, there is the possibility that the 
funny-decoded name also exists in the filesystem by the funny-decoded 
name... a data pun on the name.


Whether the characters used for funny decoding are normal or abnormal, 
if they are not prevented from also appearing in filenames when they are 
obtained from or passed to other APIs, then in order to prevent data 
puns, *all* names must be passed through the decoder, and the decoder 
must perform a 1-to-1 reversible mapping.  Martin's funny-decode process 
does not perform a 1-to-1 reversible mapping (unless he's changed it 
from the version of the PEP I found to read).


This is why some people have suggested using the null character for the 
decoding, because it and / can't appear in POSIX file names, but 
everything else can.  But that makes it really hard to display the 
funny-decoded characters.




Because of that, I now think I'm -1 on your "use printable characters
for the encoding". I think presentation of the special characters
_should_ look bogus in an app (eg little rectangles or whatever in a
GUI); it's a fine flashing red light to the user.
  


The reason I picked a ASCII printable character is just to make it 
easier for humans to see the encoding.  The scheme would also work with 
a non-ASCII non-printable character... but I fail to see how that would 
help a human compare the strings on a display of file names.  Having a 
bunch of abnormal characters in a row, displayed using a single 
replacement glyph, just makes an annoying mess in the file open dialog.



Also, by avoiding reuse of legitimate characters in the encoding we can
avoid your issue with losing track of where a string came from;
legitimate characters are cur

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 21:48, Martin v. L�wis  wrote:
| >>> There are still issues regarding how Windows and POSIX programs that
| >>> are  sharing cross-mounted file systems might communicate file names
| >>> between  each other, which is not at all clear from the PEP.  If this
| >>> is an  insoluble or un-addressed issue, it should be stated.  (It is
| >>> probably  insoluble, due to there being multiple ways that the
| >>> cross-mounted file  systems might translate names; but if there are,
| >>> can we learn something  from the rules the mounting systems use, to
| >>> be compatible with (one of)  them, or not.
| >>
| >> I'd say that's out of scope. A windows filesystem mounted on a UNIX host
| >> should probably be mounted with a mapping to translate the Windows
| >> Unicode names into whatever the sysadmin deems the locally most apt
| >> byte encoding. But sys.getfilesystemencoding() is based on the current
| >> user's locale settings, which need not be the same.
| >>   
| > 
| > And if it were, what would it do with files that can't be encoded with
| > the locally most apt byte encoding? 
| 
| As Cameron says: it's out of the scope of the PEP. It really depends how
| the operating system deals with them. Most likely, the files are not
| accessible - not only not from Python, but also not accessible from
| any other Unix program.

Well... If the files exist and the encoding of the mount software
permits, there will be a sequence of bytes for the filename, and it
will be accessible to a pure UNIX byte-speaking program. It will also
be accessible from Python, because the os.* calls convert both ways:
bytes->string an string->bytes as required. Martin's PEP just makes that
lossless, which current it is not.

Conversely, if the mount software refuses to map the filename to a POSIX
byte string, the file won't exist, or will refuse to be created. For a
concrete example we have but to observe my macify program I was trying
to counter the PEP with (I'm now a convert, btw). It is to run on a real
UNIX system and recode filenames into UTF-8 NFD, _prior_ to rsyncing
to a Mac. Why? Because the MacOSX HFS filesystem refuses to accept byte
strings not parsable by that encoding, and my music rsyncs were exploding,
refusing to create files on the target Mac.

And there's probably some grey area where a dodgy mount software will present
names that can't be used.

There's a supposed counter example in another followup post which I'll
address there, since it seemed a little bogus to me.

I think that, almost independent of this PEP, there should be an
os.fsencode() function that takes a byte string (as a POSIX OS call
will take) and performs the _same_ byte->string encoding that listdir()
and friends are doing under the hood. And a partner os.fsdecode() for
string->bytes. That will save a lot of wheel respoking and probably make
it easier for people to think about this.

Aside: thinking on that, perhaps those functions should be in posix.*,
or alternatively would a Windows system offer them in os.* to produce
native UTF-16 byte strings; useless for the WIndows API which cleanly
takes unicode (I gather) but perhaps handy for people hacking filesystems
directly or something like that.  (Except I gather from a former existence
that there is a multitude of on-disk filename encoding under WIndows
depending how old your filesystems are and if they're FAT or NTFS, etc).

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Your eyes are weary from staring at the CRT.  You feel sleepy.  Notice how
restful it is to watch the cursor blink.  Close your eyes.  The opinions
stated above are yours.  You cannot imagine why you ever felt otherwise.
- gabri...@tplrd.tpl.oz.au
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 5:42 PM, came the following characters from 
the keyboard of Cameron Simpson:

I think that, almost independent of this PEP, there should be an
os.fsencode() function that takes a byte string (as a POSIX OS call
will take) and performs the _same_ byte->string encoding that listdir()
and friends are doing under the hood. And a partner os.fsdecode() for
string->bytes. That will save a lot of wheel respoking and probably make
it easier for people to think about this.
  


If a generally useful encoding scheme is invented for transforming file 
names within Python, it should definitely be made available for those 
cases where the application must transform between an encoded Python 
name and either a str or bytes interface presented by 3rd party software.


It should be available on all platforms, so that portable code can be 
written.  Of course, if there are variations in the 3rd party software 
on the various platforms, there still may be a need for 
platform-specific code.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 18:15, Glenn Linderman  wrote:
> The problem with this, and other preceding schemes that have been
> discussed here, is that there is no means of ascertaining whether a
> particular file name str was obtained from a str API, or was funny-
> decoded from a bytes API... and thus, there is no means of reliably
> ascertaining whether a particular filename str should be passed to a
> str API, or funny-encoded back to bytes.
> 
 Why is it necessary that you are able to make this distinction?
   
>>> It is necessary that programs (not me) can make the distinction, so 
>>> that  it knows whether or not to do the funny-encoding or not.
>>> 
>>
>> I would say this isn't so. It's important that programs know if they're
>> dealing with strings-for-filenames, but not that they be able to figure
>> that out "a priori" if handed a bare string (especially since they
>> can't:-)
>
> So you agree they can't... that there are data puns.   (OK, you may not  
> have thought that through)

I agree you can't examine a string and know if it came from the os.* munging
or from someone else's munging.

I totally disagree that this is a problem.

There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) -> funny-encoded Unicode
This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) -> bytes
This is what open(filename,..) does to turn the filename into bytes
for the POSIX open.
  os.pathencode(your-string) -> funny-encoded-Unicode
This is what you must do to a de novo string to turn it into a
string suitable for use by open.
Importantly, for most strings not hand crafted to have weird
sequences in them, it is a no-op. But it will recode your puns
for survival.

and for me, I would like to see:

  os.setfilesystemencoding(coding)

Currently os.getfilesystemencoding() returns you the encoding based on
the current locale, and (I trust) the os.* stuff encodes on that basis.
setfilesystemencoding() would override that, unless coding==None in what
case it reverts to the former "use the user's current locale" behaviour.
(We have locale "C" for what one might otherwise expect None to mean:-)

The idea here is to let to program control the codec used for filenames
for special purposes, without working indirectly through the locale.

>>> If a name is  funny-decoded when the name is accessed by a directory 
>>> listing, it needs  to be funny-encoded in order to open the file.
>>
>> Hmm. I had thought that legitimate unicode strings already get transcoded
>> to bytes via the mapping specified by sys.getfilesystemencoding()
>> (the user's locale). That already happens I believe, and Martin's
>> scheme doesn't change this. He's just funny-encoding non-decodable byte
>> sequences, not the decoded stuff that surrounds them.
>
> So assume a non-decodable sequence in a name.  That puts us into  
> Martin's funny-decode scheme.  His funny-decode scheme produces a bare  
> string, indistinguishable from a bare string that would be produced by a  
> str API that happens to contain that same sequence.  Data puns.

See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.

_Lacking_ such a function, your punning concern is valid.

> So when open is handed the string, should it open the file with the name  
> that matches the string, or the file with the name that funny-decodes to  
> the same string?  It can't know, unless it knows that the string is a  
> funny-decoded string or not.

True. open() should always expect a funny-encoded name.

>> So it is already the case that strings get decoded to bytes by
>> calls like open(). Martin isn't changing that.
>
> I thought the process of converting strings to bytes is called encoding.  
> You seem to be calling it decoding?

My head must be standing in the wrong place. Yes, I probably mean
encoding here. I'm trying to accompany these terms with little pictures
like "string->bytes" to avoid confusion.

>> I suppose if your program carefully constructs a unicode string riddled
>> with half-surrogates etc and imagines something specific should happen
>> to them on the way to being POSIX bytes then you might have a problem...
>
> Right.  Or someone else's program does that.  I only want to use Unicode  
> file names.  But if those other file names exist, I want to be able to  
> access them, and not accidentally get a different file.

Point taken. And I think addressed by the utility function propo

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Benjamin Peterson
2009/4/27 Cameron Simpson :
>
> PROPOSAL: add to the PEP the following functions:
>
>  os.fsdecode(bytes) -> funny-encoded Unicode
>    This is what os.listdir() does to produce the strings it hands out.
>  os.fsencode(funny-string) -> bytes
>    This is what open(filename,..) does to turn the filename into bytes
>    for the POSIX open.
>  os.pathencode(your-string) -> funny-encoded-Unicode
>    This is what you must do to a de novo string to turn it into a
>    string suitable for use by open.
>    Importantly, for most strings not hand crafted to have weird
>    sequences in them, it is a no-op. But it will recode your puns
>    for survival.
>
> and for me, I would like to see:
>
>  os.setfilesystemencoding(coding)
>
> Currently os.getfilesystemencoding() returns you the encoding based on
> the current locale, and (I trust) the os.* stuff encodes on that basis.
> setfilesystemencoding() would override that, unless coding==None in what
> case it reverts to the former "use the user's current locale" behaviour.
> (We have locale "C" for what one might otherwise expect None to mean:-)

Time machine! 
http://docs.python.org/dev/py3k/library/sys.html#sys.setfilesystemencoding



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
Glenn Linderman wrote:
> On approximately 4/27/2009 12:42 PM, came the following characters from
> the keyboard of Martin v. Löwis:
 It's a private use area. It will never carry an official character
 assignment.
>>>
>>> I know that U+F - U+F is a private use area.  I don't find a
>>> definition of U+F01xx to know what the notation means.  Are you picking
>>> a particular character within the private use area, or a particular
>>> range, or what?
>>
>> It's a range. The lower-case 'x' denotes a variable half-byte, ranging
>> from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code
>> points.
> 
> 
> So you only need 128 code points, so there is something else unclear.

(please understand that this is history now, since the PEP has stopped
using PUA characters).

No. You seem to assume that all bytes < 128 decode successfully always.
I believe this assumption is wrong, in general:

py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
3-4: illegal multibyte sequence

All bytes are below 128, yet it fails to decode.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
> I'm not suggesting the PEP should solve the problem of mounting foreign
> file systems, although if it doesn't it should probably point that out. 
> I'm just suggesting that if the people that write software to solve the
> problem of mounting foreign file systems have already solved the naming
> problem, then it might be a source of a good solution.  On the other
> hand, it might be the source of a mediocre or bad solution.  However, if
> those mounting system have good solutions, it would be good to be
> compatible with them, rather than have yet another solution.  It was in
> that sense, of thinking about possibly existing practice, and leveraging
> an existing solution, that caused me to bring up the topic.

I think you make quite a lot of assumptions here. It would be better
to research the state of the art first, and only then propose to follow it.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 21:58, Benjamin Peterson  wrote:
| 2009/4/27 Cameron Simpson :
| > PROPOSAL: add to the PEP the following functions:
[...]
| > and for me, I would like to see:
| >  os.setfilesystemencoding(coding)
| >
| > Currently os.getfilesystemencoding() returns you the encoding based on
| > the current locale, and (I trust) the os.* stuff encodes on that basis.
| > setfilesystemencoding() would override that, unless coding==None in what
| > case it reverts to the former "use the user's current locale" behaviour.
| > (We have locale "C" for what one might otherwise expect None to mean:-)
| 
| Time machine! 
http://docs.python.org/dev/py3k/library/sys.html#sys.setfilesystemencoding

How embarrassing. I thought I'd looked.

It doesn't have the None->return-to-default mode, and I'd like to see
the word "overwritten" replaced by "overidden".

And of course if Martin's PEP gets adopted then the "e.g." cleause needs
replacing:-)
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Do not taunt Happy Fun Coder.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
>> I don't understand what you're saying. py3k filenames are all
>> unicode, even on POSIX systems, 
> 
> 
> How is that possible on POSIX systems where the underlying file system 
> uses bytes for filenames?
> 
> If I write a piece of Python code:
> 
> filename = 'some path/some name'
> 
> I might call it a filename, I might think of it as a filename, but it 
> *isn't*, it's a string in a Python program. It isn't a filename until 
> it hits the file system, and in POSIX systems that makes it bytes.

Python automatically encodes strings with the file system encoding
before passing them to the POSIX API.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull
Michael Foord writes:

 > The problem you don't address, which is still the reality for most 
 > programmers (especially Mac OS X where filesystem encoding is UTF 8), is 
 > that programmers *are* going to treat filenames as strings.

 > The proposed PEP allows that to work for them - whatever platform their 
 > program runs on.

Sure, for values of "work" == "No exception will be raised in my
module, and some content will actually be returned."  It doesn't say
anything about what happens once those strings escape the immediate
context.  So it *encourages* those programmers to pass any problems
downstream, but only after discarding the resources needed to deal
with problems effectively.

It's not that hard to overcome that problem, but it does require a
slightly more complex API, and one that doesn't return a string but
rather a stringlike object annotated with the information about how it
was decoded.  Conversion to a string *should* be trivial; I just think
it should be invoked explicitly to make it clear where information is
being discarded.  Without an implicit conversion, the nature of the
data (ie, context-dependent structure) is made explicit.  There's a
natural place to document the problem that context must be used to
interpret the data accurately, and even add more robust processing (in
a new PEP, of course!), etc.

Then in the future this interface could be used as the basis of a more
robust API.  With good design (and luck) it might be subclassible or
extensible to a path object API, for example.  PEP 383 on the other
hand is a dead end as it stands.  AFAICS it gives the best possible
treatment of conversion of OS data to plain string, but we're already
got developers lining up to say "I can't use it". :-(

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull
Tony Nelson writes:
 > At 16:09 + 04/27/2009, Antoine Pitrou wrote:
 > >Stephen J. Turnbull  xemacs.org> writes:
 > >>
 > >> I hate to break it to you, but most stages of mail processing have
 > >> very little to do with SMTP.  In particular, processing MIME
 > >> attachments often requires dealing with file names.
 > >
 > >AFAIK, the file name is only there as an indication for the user
 > >when he wants to save the file. If it's garbled a bit, no big
 > >deal.

Nobody said we were at the stage of *saving* the file!

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread James Y Knight


On Apr 27, 2009, at 11:35 PM, Martin v. Löwis wrote:

No. You seem to assume that all bytes < 128 decode successfully  
always.

I believe this assumption is wrong, in general:

py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
Traceback (most recent call last):
 File "", line 1, in 
UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
3-4: illegal multibyte sequence

All bytes are below 128, yet it fails to decode.


Surely nobody uses iso2022 as an LC_CTYPE encoding. That's expressly  
forbidden by POSIX, if I'm not mistaken...and I can't see how it would  
work, considering that it uses all the bytes from 0x20-0x7f, including  
0x2f ("/"), to represent non-ascii characters.


Hopefully it can be assumed that your locale encoding really is a non- 
overlapping superset of ASCII, as is required by POSIX...


I'm a bit scared at the prospect that U+DCAF could turn into "/", that  
just screams security vulnerability to me.  So I'd like to propose  
that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be  
encoded/decoded via the error handler.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 8:35 PM, came the following characters from 
the keyboard of Martin v. Löwis:

Glenn Linderman wrote:

On approximately 4/27/2009 12:42 PM, came the following characters from
the keyboard of Martin v. Löwis:

It's a private use area. It will never carry an official character
assignment.

I know that U+F - U+F is a private use area.  I don't find a
definition of U+F01xx to know what the notation means.  Are you picking
a particular character within the private use area, or a particular
range, or what?

It's a range. The lower-case 'x' denotes a variable half-byte, ranging
from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code
points.


So you only need 128 code points, so there is something else unclear.


(please understand that this is history now, since the PEP has stopped
using PUA characters).



Yes, but having found the latest PEP finally (at least I hope the one at 
python.org is the latest, it has quit using PUA anyway), I confirm it is 
history.  But the same issue applies to the range of half-surrogates.




No. You seem to assume that all bytes < 128 decode successfully always.
I believe this assumption is wrong, in general:

py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
3-4: illegal multibyte sequence

All bytes are below 128, yet it fails to decode.



Indeed, that was the missing piece.  I'd forgotten about the encodings 
that use escape sequences, rather than UTF-8, and DBCS.  I don't think 
those encodings are permitted by POSIX file systems, but I suppose they 
could sneak in via Environment variable values, and the like.


The switch from PUA to half-surrogates does not resolve the issues with 
the encoding not being a 1-to-1 mapping, though.  The very fact that you 
 think you can get away with use of lone surrogates means that other 
people might, accidentally or intentionally, also use lone surrogates 
for some other purpose.  Even in file names.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Robert Collins
On Mon, 2009-04-27 at 22:25 -0700, Glenn Linderman wrote:
> 
> Indeed, that was the missing piece.  I'd forgotten about the
> encodings 
> that use escape sequences, rather than UTF-8, and DBCS.  I don't
> think 
> those encodings are permitted by POSIX file systems, but I suppose
> they 
> could sneak in via Environment variable values, and the like.

This may already have been discussed, and if so I apologise for the for
the noise.

Does the PEP take into consideration the normalising behaviour of Mac
OSX ? We've had some ongoing challenges in bzr related to this with bzr.

-Rob


signature.asc
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 8:39 PM, came the following characters from 
the keyboard of Martin v. Löwis:

I'm not suggesting the PEP should solve the problem of mounting foreign
file systems, although if it doesn't it should probably point that out. 
I'm just suggesting that if the people that write software to solve the

problem of mounting foreign file systems have already solved the naming
problem, then it might be a source of a good solution.  On the other
hand, it might be the source of a mediocre or bad solution.  However, if
those mounting system have good solutions, it would be good to be
compatible with them, rather than have yet another solution.  It was in
that sense, of thinking about possibly existing practice, and leveraging
an existing solution, that caused me to bring up the topic.



I think you make quite a lot of assumptions here. It would be better
to research the state of the art first, and only then propose to follow it.


I didn't propose to follow it.  I only proposed an area that could be 
researched as a source of ideas and/or potential solutions.  Apparently 
there wasn't, but there could have been someone listening that had the 
results of such research on the tip of their tongue, and might have 
piped up with the techniques used.  I did, in fact, begin researching 
the topic after making the suggestion, and thus far haven't found any 
brilliant solutions from that arena.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 383 (again)

2009-04-27 Thread Thomas Breuel
I thought PEP-383 was a fairly neat approach, but after thinking about it, I
now think that it is wrong.

PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode strings in
a reversible way.  But how do those non-UTF-8 byte sequences get into those
path names in the first place?  Most likely because an encoding other than
UTF-8 was used to write the file system, but you're now trying to interpret
its path names as UTF-8.

Quietly escaping a bad UTF-8 encoding with private Unicode characters is
unlikely to be the right thing, since using the wrong encoding likely means
that other characters are decoded incorrectly as well.   As a result, the
path name may fail in string comparisons and pattern matching, and will look
wrong to the user in print statements and dialog boxes. Therefore, when
Python encounters path names on a file system that are not consistent with
the (assumed) encoding for that file system, Python should raise an error.

If you really don't care what the string looks like and you just want an
encoding that round-trips without loss, you can probably just set your
encoding to one of the 8 bit encodings, like ISO 8859-15.   Decoding
arbitrary byte sequences to unicode strings as ISO 8859-15 is no less
correct than decoding them as the proposed "utf-8b".  In fact, the most
likely source of non-UTF-8 sequences is ISO 8859 encodings.

As for what the byte-oriented interfaces should do, they are simply platform
dependent.  On UNIX, they should do the obvious thing.  On Windows, they can
either hook up to the low-level byte-oriented system calls that the systems
supply, or Windows could fake it and have the byte-oriented interfaces use
UTF-8 encodings always and reject non-UTF-8 sequences as illegal (there are
already many illegal byte sequences anyway).

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
James Y Knight wrote:
> Hopefully it can be assumed that your locale encoding really is a
> non-overlapping superset of ASCII, as is required by POSIX...

Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?

> I'm a bit scared at the prospect that U+DCAF could turn into "/", that
> just screams security vulnerability to me.  So I'd like to propose that
> only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
> encoded/decoded via the error handler.

It would be actually U+DC2f that would turn into /.
I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 7:11 PM, came the following characters from 
the keyboard of Cameron Simpson:

On 27Apr2009 18:15, Glenn Linderman  wrote:
  

The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.



Why is it necessary that you are able to make this distinction?
  
  
It is necessary that programs (not me) can make the distinction, so 
that  it knows whether or not to do the funny-encoding or not.



I would say this isn't so. It's important that programs know if they're
dealing with strings-for-filenames, but not that they be able to figure
that out "a priori" if handed a bare string (especially since they
can't:-)
  
So you agree they can't... that there are data puns.   (OK, you may not  
have thought that through)



I agree you can't examine a string and know if it came from the os.* munging
or from someone else's munging.

I totally disagree that this is a problem.

There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) -> funny-encoded Unicode
This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) -> bytes
This is what open(filename,..) does to turn the filename into bytes
for the POSIX open.
  os.pathencode(your-string) -> funny-encoded-Unicode
This is what you must do to a de novo string to turn it into a
string suitable for use by open.
Importantly, for most strings not hand crafted to have weird
sequences in them, it is a no-op. But it will recode your puns
for survival.

and for me, I would like to see:

  os.setfilesystemencoding(coding)

Currently os.getfilesystemencoding() returns you the encoding based on
the current locale, and (I trust) the os.* stuff encodes on that basis.
setfilesystemencoding() would override that, unless coding==None in what
case it reverts to the former "use the user's current locale" behaviour.
(We have locale "C" for what one might otherwise expect None to mean:-)

The idea here is to let to program control the codec used for filenames
for special purposes, without working indirectly through the locale.

  
If a name is  funny-decoded when the name is accessed by a directory 
listing, it needs  to be funny-encoded in order to open the file.


Hmm. I had thought that legitimate unicode strings already get transcoded
to bytes via the mapping specified by sys.getfilesystemencoding()
(the user's locale). That already happens I believe, and Martin's
scheme doesn't change this. He's just funny-encoding non-decodable byte
sequences, not the decoded stuff that surrounds them.
  
So assume a non-decodable sequence in a name.  That puts us into  
Martin's funny-decode scheme.  His funny-decode scheme produces a bare  
string, indistinguishable from a bare string that would be produced by a  
str API that happens to contain that same sequence.  Data puns.



See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.

_Lacking_ such a function, your punning concern is valid.
  


Seems like one would also desire os.pathdecode to do the reverse.  And 
also versions that take or produce bytes from funny-encoded strings.


Then, if programs were re-coded to perform these transformations on what 
you call de novo strings, then the scheme would work.


But I think a large part of the incentive for the PEP is to try to 
invent a scheme that intentionally allows for the puns, so that programs 
do not need to be recoded in this manner, and yet still work.  I don't 
think such a scheme exists.


If there is going to be a required transformation from de novo strings 
to funny-encoded strings, then why not make one that people can actually 
see and compare and decode from the displayable form, by using 
displayable characters instead of lone surrogates?



So when open is handed the string, should it open the file with the name  
that matches the string, or the file with the name that funny-decodes to  
the same string?  It can't know, unless it knows that the string is a  
funny-decoded string or not.



True. open() should always expect a funny-encoded name.

  

So it is already the case that strings get decoded to bytes by
cal

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
> Does the PEP take into consideration the normalising behaviour of Mac
> OSX ? We've had some ongoing challenges in bzr related to this with bzr.

No, that's completely out of scope, AFAICT. I don't even know what the
issues are, so I'm not able to propose a solution, at the moment.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 (again)

2009-04-27 Thread Martin v. Löwis
> PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode
> strings in a reversible way.

That isn't really true; it is not, inherently, about UTF-8.
Instead, it tries to represent non-filesystem-encoding byte sequence
in Unicode strings in a reversible way.

> Quietly escaping a bad UTF-8 encoding with private Unicode characters is
> unlikely to be the right thing

And indeed, the PEP stopped using PUA characters.

> Therefore, when Python encounters path names on a file system
> that are not consistent with the (assumed) encoding for that file
> system, Python should raise an error. 

This is what happens currently, and users are quite unhappy about it.

> If you really don't care what the string looks like and you just want an
> encoding that round-trips without loss, you can probably just set your
> encoding to one of the 8 bit encodings, like ISO 8859-15.   Decoding
> arbitrary byte sequences to unicode strings as ISO 8859-15 is no less
> correct than decoding them as the proposed "utf-8b".  In fact, the most
> likely source of non-UTF-8 sequences is ISO 8859 encodings.

Yes, users can do that (to a degree), but they are still unhappy about
it. The approach actually fails for command line arguments

> As for what the byte-oriented interfaces should do, they are simply
> platform dependent.  On UNIX, they should do the obvious thing.  On
> Windows, they can either hook up to the low-level byte-oriented system
> calls that the systems supply, or Windows could fake it and have the
> byte-oriented interfaces use UTF-8 encodings always and reject non-UTF-8
> sequences as illegal (there are already many illegal byte sequences
> anyway).

As is, these interfaces are incomplete - they don't support command
line arguments, or environment variables. If you want to complete them,
you should write a PEP.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com