Re: [Python-Dev] Proposed: drop unnecessary "context" pointer from PyGetSetDef

2009-05-05 Thread Larry Hastings


Mark Dickinson wrote:

This doesn't sound right. The functions in the third party code will get
compiled with the wrong signature, so they can crash (or behave unexpectedly)
when called by Python.


Yes, of course the signature of the getters and setters changes.  Please
ignore me. :-)


If they don't use the closure field, then either they won't compile due 
to type mismatches or they'll work fine.  There's a lot of code in 
CPython that didn't need to be changed for my remove-closure patch; the 
functions didn't bother taking the "void * closure" that they were going 
to ignore anyway, and then they cast the function pointer in the 
PyGetSetDef to make the compiler shut up.  Worked fine.  And, in nearly 
all cases, the static PyGetSetDefs omit the closure member, which means 
C initializes them with a 0.



/larry/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread M.-A. Lemburg
On 2009-05-03 19:39, Martin v. Löwis wrote:
>> If the error handler is supposed to be used for codecs other than utf-8,
>> perhaps it should renamed something more generic, e.g. "surrogate-escape"?
> 
> Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
> it's an algorithm based on 16-bit or 32-bit code points.

If the error handler doesn't have anything to do with UTF-8, then why
do you use "utf8" in the name.

Please use a more descriptive name for the handler which does not cause
confusion with a existing codec.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 05 2009)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-06-29: EuroPython 2009, Birmingham, UK54 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Terry Reedy

M.-A. Lemburg wrote:

On 2009-05-03 19:39, Martin v. Löwis wrote:

If the error handler is supposed to be used for codecs other than utf-8,
perhaps it should renamed something more generic, e.g. "surrogate-escape"?

Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
it's an algorithm based on 16-bit or 32-bit code points.


If the error handler doesn't have anything to do with UTF-8, then why
do you use "utf8" in the name.

Please use a more descriptive name for the handler which does not cause
confusion with a existing codec.


Having already been confused, I agree.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath

2009-05-05 Thread Eric Smith

Mark Hammond wrote:
Is that enough consensus for it to go in?  If so, are there any core 
developers who could help me get it in before the 3.1 feature freeze?  
The patch should be in good shape; it has unit tests and updated 
documentation.


I've taken the liberty of explicitly CCing Martin just incase he missed 
the thread with all the noise regarding PEP383.


If there are no objections from Martin or anyone else here, please feel 
free to assign it to me (and mail if I haven't taken action by the day 
before the beta freeze...)


Mark: I've reviewed this and it looks okay to me. It passes all the 
tests on Windows and Linux. But if you could take a look at it before 
the release tomorrow, I'd appreciate it.


I feel good enough about it to check it in if no one else gets to it.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] using help function in Py3k

2009-05-05 Thread s|s
Hello,

I Ran Python 3.0 for the first time. I used help() function and wrote
"modules hash". It issues an error.

Traceback (most recent call last):
  File "", line 1, in 
  File "/home/ss/eproj/xapian/INST//lib/python3.0/site.py", line 427,
in __call__
return pydoc.help(*args, **kwds)
  File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line
1675, in __call__
self.interact()
  File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line
1693, in interact
self.help(request)
  File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line 1711, in help
self.listmodules(request.split()[1])
  File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line
1799, in listmodules
apropos(key)
  File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line
1913, in apropos
ModuleScanner().run(callback, key, onerror=onerror)
  File "/home/ss/eproj/xapian/INST//lib/python3.0/pydoc.py", line 1875, in run
source = loader.get_source(modname)
  File "/home/ss/eproj/xapian/INST/lib/python3.0/pkgutil.py", line
293, in get_source
self.source = self.file.read()
  File "/home/ss/eproj/xapian/INST//lib/python3.0/io.py", line 1720, in read
decoder = self._decoder or self._get_decoder()
  File "/home/ss/eproj/xapian/INST//lib/python3.0/io.py", line 1506,
in _get_decoder
make_decoder = codecs.getincrementaldecoder(self._encoding)
  File "/home/ss/eproj/xapian/INST//lib/python3.0/codecs.py", line
960, in getincrementaldecoder
decoder = lookup(encoding).incrementaldecoder
LookupError: unknown encoding: uft-8

The reason for errors is test/ directory which has got tests for
python parser are installed in Lib directory. I propose that these
files should be installed by default in some other directory.
Preferably in /share or /share/doc part of the tree.


regards


-- 
~preet~
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] using help function in Py3k

2009-05-05 Thread Aahz
On Tue, May 05, 2009, s|s wrote:
> 
> I Ran Python 3.0 for the first time. I used help() function and wrote
> "modules hash". It issues an error.

Please file a report on bugs.python.org
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

"It is easier to optimize correct code than to correct optimized code."
--Bill Harlan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
M.-A. Lemburg writes:
 > On 2009-05-03 19:39, Martin v. Löwis wrote:
 > >> If the error handler is supposed to be used for codecs other than utf-8,
 > >> perhaps it should renamed something more generic, e.g. "surrogate-escape"?
 > > 
 > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
 > > it's an algorithm based on 16-bit or 32-bit code points.

I don't understand this phrasing.  The algorithm is only applicable to
ASCII-compatible octet streams.  It results in code points by a simple
displacement of octet -> octet + 0xDC00.  It cannot be used on (say)
UTF-32 to deal with embedded surrogates.

Certainly, the computation requires (at least) 16 bit numbers, but the
input must be restricted to a stream of 8-bit code points, while the
output is 16- or 32-bit code points.

 > Please use a more descriptive name [than "utf-8b"] for the handler
 > which does not cause confusion with a existing codec.

But please don't use "surrogate-escape" or (as in the current PEP)
"python-escape"; it's not an escaping (quotation) mechanism.
"surrogate-replace", "surrogate-substitute", or "surrogate-translate"
would be better names.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] using help function in Py3k

2009-05-05 Thread Daniel Stutzbach
On Tue, May 5, 2009 at 5:41 AM, s|s  wrote:

> LookupError: unknown encoding: uft-8
>

uft-8?

Looks like a variation of Issue 4540  (or
a duplicate?  I can't tell)

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] [Fwd: [Python-checkins] r72331 - python/branches/py3k/Modules/posixmodule.c]

2009-05-05 Thread Eric Smith
Modules/posixmodule.c now compiles for me, but I get a Bus Error in 
test_lchflags when running test_posixmodule on Mac OS X 10.5. I'll open 
a release blocker bug on this.


 Original Message 
Subject: [Python-checkins] r72331 - 
python/branches/py3k/Modules/posixmodule.c

Date: Tue,  5 May 2009 15:07:31 +0200 (CEST)
From: eric.smith 
To: python-check...@python.org

Author: eric.smith
Date: Tue May  5 15:07:30 2009
New Revision: 72331

Log:
Added missing semicolon.

Modified:
   python/branches/py3k/Modules/posixmodule.c

Modified: python/branches/py3k/Modules/posixmodule.c
==
--- python/branches/py3k/Modules/posixmodule.c  (original)
+++ python/branches/py3k/Modules/posixmodule.c  Tue May  5 15:07:30 2009
@@ -1928,7 +1928,7 @@
if (!PyArg_ParseTuple(args, "O&i:lchmod", PyUnicode_FSConverter,
  &opath, &i))
return NULL;
-   path = bytes2str(opath, 1)
+   path = bytes2str(opath, 1);
Py_BEGIN_ALLOW_THREADS
res = lchmod(path, i);
Py_END_ALLOW_THREADS
___
Python-checkins mailing list
python-check...@python.org
http://mail.python.org/mailman/listinfo/python-checkins

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > I've updated the PEP accordingly.

I have three substantive comments.  First, although consequences for
Python 3 byte interfaces (ie, "none") are explicitly stated, as far as
I can see this PEP could apply to Python 2 as well.  I don't think
it's intended that way.  Either way, I think you should clarify that
point.

Second, I suggest "surrogate-replace" as the name of the error handler
rather than "utf8b".  (Elsewhere I've suggested others, but I think
this is the best of the bunch.)

Third, it is not clear to me why non-decodable ASCII should be an
error.  There are plenty of low surrogates for the purpose.  Is there
another technical reason?  Stupid or not, Shift-JIS- and Big5-encoded
file systems are quite common in Asia still (including non-rewritable
media).  I think surrogate-replacement of ASCII should at least be an
option.

I don't think "people shouldn't be using non-ASCII-compatible
encodings for locale encodings" is a sufficient rationale for a hard
error here.  I mean, of course they *should* be using UTF-8.  Maybe
Python 3.1 should just go ahead and error on any other encoding on
POSIX platforms? 

I have a number of nitpicking comments and technical clarifications on
the PEP.  Rationale is in footnotes.  There were also a few typos I
noticed.

1.  There is no such thing as a "half-surrogate" in Unicode.  "Lone
surrogate" is clear enough.  Or for somewhat fancier English,
"isolated surrogate" or "non-syntactic surrogate".  To emphasize
that Python codecs will only produce them in contexts where a
Unicode character or high surrogate (for UTF-16 Python) is
syntactically required, "isolated low surrogate" or "isolated
trailing surrogate" might be good.[1]

2.  The specification should state, and the discussion emphasize, that
strings which were produced by surrogate replacement *must not* be
used in data interchange with systems that do not specifically
accept such strings, and that this is the responsibility of the
application.[2]

Rather than saying that "dealing with such conflicts is out of
scope of this PEP", I would say

"""Dealing with such conflicts is the responsibility of the
application.  Since this PEP's mechanism produces valid Unicode
where possible, and produces *invalid* code points only via the
error handler, one strategy is for the application to validate all
other sources of strings as Unicode conforming.  There may be
other useful application-specific strategies, as well."""

3.  In the discussion, the transition from the example of alternative
use of 'python-escape' to discussion of the error handler
interface extension is a bit abrupt.  I suggest rewriting as:

"""The extension to the encode error handler interface proposed by
this PEP is necessary to implement the 'utf8b' error handler,
because there are required byte sequences which cannot be
generated from replacement Unicode.  However, the encode error
handler interface presently requires replacement Unicode to be
provided in lieu of the non-encodable Unicode from the source
string.  Then it promptly encodes that replacement Unicode.  In
some error handlers, such as the 'utf8b' proposed here, it is also
simpler and more efficient for the error handler to provide a
pre-encoded replacement byte string, rather than forcing it to
calculating Unicode from which the encoder would create the
desired bytes."""

Typos (line references are to pep-0383.txt svn r72332):

l.  86: "Byte-orientied" -> "Byte-oriented"
l.  98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b"
l. 130: "provide" -> "provided"
l. 134: "calculating" -> "calculate"


Footnotes: 
[1] Unicode 5.0 uses the terms "high-half" and "low-half" at least
once, in section 16.6, but the context is such that I take it to
refer to "half of the surrogate area".  Section 3.8 doesn't use
these, instead noting that "leading" and "trailing" are sometimes
used instead of "high" and "low".  Better to avoid the word "half"
in PEP 383, I think.

[2] Since this error handler is going to be the default for POSIX I/O,
of course people are going to mostly ignore that restriction.  The
point is, passing such strings to systems that don't expect them
is a bug, and the PEP should make it clear that it's the app's
bug, not the other system's.  On the other hand, using those
strings in a context of consenting adults (and I do mean
double-opt-in here) is perfectly acceptable.  I'm specifically
thinking of use in the Tahoe protocol discussed by Zooko
O'Whielacronx; it may not be usable there for backward
compatibility reasons, but "Unicode conformance" is not an issue
in principle.

This does imply that programs that take advantage of the error
handler specified in this PEP are on their own if they accept data
from any sources that are not known to be Unicode-con

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Zooko O'Whielacronx
On Tue, May 5, 2009 at 8:57 AM, Stephen J. Turnbull  wrote:
>
> 2.  The specification should state, and the discussion emphasize, that
>    strings which were produced by surrogate replacement *must not* be
>    used in data interchange with systems that do not specifically
>    accept such strings, and that this is the responsibility of the
>    application.[2]

That sounds like a useful statement to make.  How would an application
make sure that they were producing only valid unicode?  How about add
an option to os.listdir() named "errors" with default value 'utf8b'
(or 'surrogate-replace', or whatever the name is)?  Then applications
which need to produce only valid unicode strings could pass
errors=strict, errors=ignore, or errors=replace?  (If anyone really
wants behavior like Python 3.0 then we could perhaps also add a new
one just for os.listdir() named errors=skipfilename.)

My most recent plan for Tahoe, as of the letter that I sent last
night, is to emulate the behavior of Nautilus and GNU ls by using the
'replace' error handler and (emulating Nautilus) to append " (invalid
encoding)" to the end of the string.  (screenshot:
http://zooko.com/Nautilus_vs_invalid_encoding.png )

So if I could ask os.listdir to return filenames with U+FFFD in place
of undecodable characters, then I could subsequently do something
like:

for f in os.listdir(d, errors='replace'):
if u"\ufffd" in f:
f += " (invalid encoding)"

(On top of that I would have to check for collisions, but that's out of scope.)

Regards,

Zooko
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread MRAB

Stephen J. Turnbull wrote:

"Martin v. Löwis" writes:

 > I've updated the PEP accordingly.

I have three substantive comments.  First, although consequences for
Python 3 byte interfaces (ie, "none") are explicitly stated, as far as
I can see this PEP could apply to Python 2 as well.  I don't think
it's intended that way.  Either way, I think you should clarify that
point.

Second, I suggest "surrogate-replace" as the name of the error handler
rather than "utf8b".  (Elsewhere I've suggested others, but I think
this is the best of the bunch.)


+1


Third, it is not clear to me why non-decodable ASCII should be an
error.  There are plenty of low surrogates for the purpose.  Is there
another technical reason?  Stupid or not, Shift-JIS- and Big5-encoded
file systems are quite common in Asia still (including non-rewritable
media).  I think surrogate-replacement of ASCII should at least be an
option.

I don't think "people shouldn't be using non-ASCII-compatible
encodings for locale encodings" is a sufficient rationale for a hard
error here.  I mean, of course they *should* be using UTF-8.  Maybe
Python 3.1 should just go ahead and error on any other encoding on
POSIX platforms? 


I don't see why the error handler couldn't in principle be used with
encodings other than UTF-8, although in that case all of the low
surrogates should be open to use.


I have a number of nitpicking comments and technical clarifications on
the PEP.  Rationale is in footnotes.  There were also a few typos I
noticed.

1.  There is no such thing as a "half-surrogate" in Unicode.  "Lone
surrogate" is clear enough.  Or for somewhat fancier English,
"isolated surrogate" or "non-syntactic surrogate".  To emphasize
that Python codecs will only produce them in contexts where a
Unicode character or high surrogate (for UTF-16 Python) is
syntactically required, "isolated low surrogate" or "isolated
trailing surrogate" might be good.[1]

2.  The specification should state, and the discussion emphasize, that
strings which were produced by surrogate replacement *must not* be
used in data interchange with systems that do not specifically
accept such strings, and that this is the responsibility of the
application.[2]

Rather than saying that "dealing with such conflicts is out of
scope of this PEP", I would say

"""Dealing with such conflicts is the responsibility of the
application.  Since this PEP's mechanism produces valid Unicode
where possible, and produces *invalid* code points only via the
error handler, one strategy is for the application to validate all
other sources of strings as Unicode conforming.  There may be
other useful application-specific strategies, as well."""

3.  In the discussion, the transition from the example of alternative
use of 'python-escape' to discussion of the error handler
interface extension is a bit abrupt.  I suggest rewriting as:

"""The extension to the encode error handler interface proposed by
this PEP is necessary to implement the 'utf8b' error handler,
because there are required byte sequences which cannot be
generated from replacement Unicode.  However, the encode error
handler interface presently requires replacement Unicode to be
provided in lieu of the non-encodable Unicode from the source
string.  Then it promptly encodes that replacement Unicode.  In
some error handlers, such as the 'utf8b' proposed here, it is also
simpler and more efficient for the error handler to provide a
pre-encoded replacement byte string, rather than forcing it to
calculating Unicode from which the encoder would create the
desired bytes."""

Typos (line references are to pep-0383.txt svn r72332):

l.  86: "Byte-orientied" -> "Byte-oriented"
l.  98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b"
l. 130: "provide" -> "provided"
l. 134: "calculating" -> "calculate"


Footnotes: 
[1] Unicode 5.0 uses the terms "high-half" and "low-half" at least

once, in section 16.6, but the context is such that I take it to
refer to "half of the surrogate area".  Section 3.8 doesn't use
these, instead noting that "leading" and "trailing" are sometimes
used instead of "high" and "low".  Better to avoid the word "half"
in PEP 383, I think.


"Leading" and "trailing" simply state the order, not the set ("high" or
"low"), so are not good terms to use.


[2] Since this error handler is going to be the default for POSIX I/O,
of course people are going to mostly ignore that restriction.  The
point is, passing such strings to systems that don't expect them
is a bug, and the PEP should make it clear that it's the app's
bug, not the other system's.  On the other hand, using those
strings in a context of consenting adults (and I do mean
double-opt-in here) is perfectly acceptable.  I'm specifically
thinking of use in the Tahoe protocol discussed by Zooko
O'Whielacronx; it 

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
Zooko O'Whielacronx writes:

 > How would an application make sure that they were producing only
 > valid unicode?

That's very difficult.  There are a couple of sources that I can think
of, in Python: C modules, chr(), \u literals, and now codecs with the
'utf8b'.  There may be others.  You'd need to review your own code for
all of them very carefully, and you'd have to validate all strings
returned by non-validating APIs (which is all of them in Python now,
although many of them can probably be trusted, such as codecs not
using the 'utf8b' error handler).

 > How about add an option to os.listdir() named "errors" with default
 > value 'utf8b'

Seems reasonable to me, but Martin's probably thought more carefully
about it.  I don't think its applicable to your use case, though,
because you want to be able to *access* those files as well as display
the names to the users, right?  You won't be able to access those
files if you receive the names already munged by the error handler.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
MRAB writes:

 > > I don't think "people shouldn't be using non-ASCII-compatible
 > > encodings for locale encodings" is a sufficient rationale for a hard
 > > error here.  I mean, of course they *should* be using UTF-8.  Maybe
 > > Python 3.1 should just go ahead and error on any other encoding on
 > > POSIX platforms? 
 > > 
 > I don't see why the error handler couldn't in principle be used with
 > encodings other than UTF-8, although in that case all of the low
 > surrogates should be open to use.

I should have been more clear here, I guess.  The error handler *can*,
and in the PEP *will be* by default, used with all "sane" locale
encodings on POSIX.

It occurs to me that the PEP maybe should say that it is an error
to have your POSIX locale set to UTF-16 or something like that.

What "sane" means in this context is

1.  ASCII NUL is the bytearray terminator, and can't be used as a byte
in a file name.  This rules out UTF-16, UTF-32, and widechar EUC
encodings, as well as some very rare ones.

2.  An ASCII character always translates to the Unicode character with
the same code (ie, "to itself").  It is not a part of other
sequences (control sequences, or a trailing byte).  This rules out
EBCDIC, ISO-2022-*, Shift JIS, and Big5, among the encodings I'm
familiar with.  EBCDIC because only by accident will an EBCDIC
character map to the same ASCII character with the same code.  The
ISO-2022-* encodings are out because ASCII characters are used in
escape sequences.  Shift JIS and Big5 because in those encodings,
a high-bit-set octet signals the start of a multibyte sequence,
and some of the trailing bytes may be in the ASCII range.

What's left?  Well, UTF-8, all of the ISO-8859 sets, several national
standards (such as the KOI8 family for Cyrillic), IBM and Microsoft
"code pages", and the "packed" EUC encodings used for Japanese,
Chinese, and Korean.  These all have the character that ASCII is
ASCII, and all non-ASCII characters are encoded using only
high-bit-set octets.  In fact, in practice, on Unix these are
invariably what you encounter.

So what's the problem?  Backward compatibility for Microsoft OSes,
which not only used to use MBCS national character sets, but
"cleverly" packed more characters into the encoding by using ASCII as
trailing bytes.  Ie, the aforementioned "insane" Shift JIS (which is
mandated by the leading Japanese cellphone service provider even
today) and Big5 (the leading encoding for Chinese until very
recently).  These are very commonly found on archival media, and even
on USB keys and so on which tend to be FAT-formatted.  This doesn't
prevent usage of the Unicode APIs, but up to Windows 2000 most
Japanese vendors' OEM version of Windows used FAT format and Shift JIS
as the file system encoding, and I know of Japanese offices where
Windows 98 systems were in use as recently as early 2007.

It's the removable media which are the problem, because on Windows you
just use the Unicode APIs.  But they're not available on Unix, so you
need the byte-oriented APIs.

Is this a real problem?  I don't know, I don't do Windows, I don't do
computing with my cellphone, and I don't need to get Japanese (that
might be mixed with Russian ones!!) filenames off of ancient media or
CIFS fileshares using Shift JIS.  I guess it's possible that
cellphones do everything *except* add filenames to directories in
Shift JIS, but the filenames are in UTF-16.

OTOH, it seems to me that an *optional* extension to handling error on
ASCII is technically feasible and would be nearly trivial to add to
the PEP.  The biggest cost would be adding the error argument to
various functions (as Zooko requested) so that
surrogate-replace-extended could be specified if needed.

 > > Footnotes: 
 > > [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least
 > > once, in section 16.6, but the context is such that I take it to
 > > refer to "half of the surrogate area".  Section 3.8 doesn't use
 > > these, instead noting that "leading" and "trailing" are sometimes
 > > used instead of "high" and "low".  Better to avoid the word "half"
 > > in PEP 383, I think.
 > > 
 > "Leading" and "trailing" simply state the order, not the set ("high" or
 > "low"), so are not good terms to use.

But it's the order that's important.  If you've just finished reading
a character, and encounter a trailing surrogate, then it was produced
by the 'utf8b' error handler; nothing else in a Python codec can do
that.  If you've just finished reading a character, are in a UTF-16
Python, and encounter a leading surrogate, then you immediately gobble
the following code, which must be a trailing surrogate, and combine
them to produce a character.  The remaining case is that you encounter
a valid character.  Anything else is an error, and (assuming no bugs),
no Python codec will produce anything else.

 > > This does imply that programs that take advantage of the error
 >

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread MRAB

Stephen J. Turnbull wrote:

MRAB writes:

 > > I don't think "people shouldn't be using non-ASCII-compatible
 > > encodings for locale encodings" is a sufficient rationale for a hard
 > > error here.  I mean, of course they *should* be using UTF-8.  Maybe
 > > Python 3.1 should just go ahead and error on any other encoding on
 > > POSIX platforms? 
 > > 
 > I don't see why the error handler couldn't in principle be used with

 > encodings other than UTF-8, although in that case all of the low
 > surrogates should be open to use.

I should have been more clear here, I guess.  The error handler *can*,
and in the PEP *will be* by default, used with all "sane" locale
encodings on POSIX.

It occurs to me that the PEP maybe should say that it is an error
to have your POSIX locale set to UTF-16 or something like that.

What "sane" means in this context is

1.  ASCII NUL is the bytearray terminator, and can't be used as a byte
in a file name.  This rules out UTF-16, UTF-32, and widechar EUC
encodings, as well as some very rare ones.


[snip]
It might be slightly OT, but sometimes strict UTF-8 encoding is violated
by encoding U+ using 2 bytes (0xC0 0x80) so that 0x00 can be used as
a terminator. I think I read that Microsoft sometimes does this.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
MRAB writes:

 > [snip]
 > It might be slightly OT, but sometimes strict UTF-8 encoding is violated
 > by encoding U+ using 2 bytes (0xC0 0x80) so that 0x00 can be used as
 > a terminator. I think I read that Microsoft sometimes does this.

Nice hack! as long as you don't let it escape.  But if 'strict' errors
on this, then PEP 383 'utf8b' will do the right thing, I think.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Lino Mastrodomenico
2009/5/5 Stephen J. Turnbull :
> Third, it is not clear to me why non-decodable ASCII should be an
> error.

The PEP originally allowed the conversion to U+DCxx of bytes below 128
that cannot be decoded by the encoding used, but this creates
potential security problems.

See: 

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Martin v. Löwis
>  > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 -
>  > > it's an algorithm based on 16-bit or 32-bit code points.
> 
> I don't understand this phrasing.  The algorithm is only applicable to
> ASCII-compatible octet streams.  It results in code points by a simple
> displacement of octet -> octet + 0xDC00.  It cannot be used on (say)
> UTF-32 to deal with embedded surrogates.
> 
> Certainly, the computation requires (at least) 16 bit numbers, but the
> input must be restricted to a stream of 8-bit code points, while the
> output is 16- or 32-bit code points.

Right - the algorithm maps between bytes and 16/32-bit code units.
It works, in particular, for UTF-8, and was originally proposed to apply
to UTF-8 - but it can work in any other place that converts bytes to
16/32-bit code units as well.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Martin v. Löwis
> I have three substantive comments.  First, although consequences for
> Python 3 byte interfaces (ie, "none") are explicitly stated, as far as
> I can see this PEP could apply to Python 2 as well.  I don't think
> it's intended that way.  Either way, I think you should clarify that
> point.

Done: the Python-Version header already clarifies that point.

> Second, I suggest "surrogate-replace" as the name of the error handler
> rather than "utf8b".

I think this is bike-shedding.

> Third, it is not clear to me why non-decodable ASCII should be an
> error.  There are plenty of low surrogates for the purpose.  Is there
> another technical reason?  Stupid or not, Shift-JIS- and Big5-encoded
> file systems are quite common in Asia still (including non-rewritable
> media).  I think surrogate-replacement of ASCII should at least be an
> option.

It's a security risk. If U+DCXX would map to \xXX, then somebody could
embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
sanitized, nobody would expect that this will actually access ../

> 1.  There is no such thing as a "half-surrogate" in Unicode.  "Lone
> surrogate" is clear enough.  Or for somewhat fancier English,
> "isolated surrogate" or "non-syntactic surrogate".  To emphasize
> that Python codecs will only produce them in contexts where a
> Unicode character or high surrogate (for UTF-16 Python) is
> syntactically required, "isolated low surrogate" or "isolated
> trailing surrogate" might be good.[1]

Fixed. I removed the world "half" everywhere. It really doesn't mean
anything to me (it could have been called sunnygate instead, making
no difference).

I tried to understand "surrogate", and it was explained to me that
"surrogate" is something that stands for something - but then I
would argue that the two subsequence codes form a surrogate - they
stand for something else. The individual surrogate code (in Unicode
terminology) doesn't stand for anything. So don't you agree that
it is the Unicode terminology that is in error, not the PEP?

> 2.  The specification should state, and the discussion emphasize, that
> strings which were produced by surrogate replacement *must not* be
> used in data interchange with systems that do not specifically
> accept such strings, and that this is the responsibility of the
> application.[2]

No. The specification puts no requirements on applications whatsoever.
So if you propose to use MUST NOT in the RFC 2119 sense, I strongly
disagree.

Applications that desire mojibake are free to produce it; we are
consenting adults; and all that.

> 3.  In the discussion, the transition from the example of alternative
> use of 'python-escape' to discussion of the error handler
> interface extension is a bit abrupt.  I suggest rewriting as:
> 
> """The extension to the encode error handler interface proposed by
> this PEP is necessary to implement the 'utf8b' error handler,
> because there are required byte sequences which cannot be
> generated from replacement Unicode.  However, the encode error
> handler interface presently requires replacement Unicode to be
> provided in lieu of the non-encodable Unicode from the source
> string.  Then it promptly encodes that replacement Unicode.  In
> some error handlers, such as the 'utf8b' proposed here, it is also
> simpler and more efficient for the error handler to provide a
> pre-encoded replacement byte string, rather than forcing it to
> calculating Unicode from which the encoder would create the
> desired bytes."""

Unfortunately, I failed to understand where you want this text to
go. What paragraphs should I remove, or (if none), after which
paragraph should I insert this text?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Martin v. Löwis
> It occurs to me that the PEP maybe should say that it is an error
> to have your POSIX locale set to UTF-16 or something like that.

No. It is *impossible* to have UTF-16 as the locale character set,
not an error. Your statement is like saying "it is an error to
breathe in the vacuum".

In any case, the discussion says

# Encodings that are not compatible with ASCII are not supported by
# this specification; bytes in the ASCII range that fail to decode
# will cause an exception. It is widely agreed that such encodings
# should not be used as locale charsets.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread M.-A. Lemburg
Martin v. Löwis wrote:
>> I have three substantive comments.  First, although consequences for
>> Python 3 byte interfaces (ie, "none") are explicitly stated, as far as
>> I can see this PEP could apply to Python 2 as well.  I don't think
>> it's intended that way.  Either way, I think you should clarify that
>> point.
> 
> Done: the Python-Version header already clarifies that point.
> 
>> Second, I suggest "surrogate-replace" as the name of the error handler
>> rather than "utf8b".
> 
> I think this is bike-shedding.

The name "utf8b" suggested in the PEP is not in line with the codec
design and causes confusion with an existing codec of a similar name.

Error handlers and codecs are two different things, so the namespaces
need to be clearly separate.

Please change the name of the error handler to a different name that
does not resemble or cause confusion with a codec name and fits the
scheme of error handler names we already have in place in Python for
replacing error handlers, i.e. "XYZreplace".

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2009)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-06-29: EuroPython 2009, Birmingham, UK53 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:
 > > It occurs to me that the PEP maybe should say that it is an error
 > > to have your POSIX locale set to UTF-16 or something like that.
 > 
 > No. It is *impossible* to have UTF-16 as the locale character set,
 > not an error. Your statement is like saying "it is an error to
 > breathe in the vacuum".

I realize this is not useful, so maybe you don't need to mention it.
However, it certainly is possible to set LANG with an absurd, or
merely dangerous, encoding.

 > In any case, the discussion says
 > 
 > # Encodings that are not compatible with ASCII are not supported by
 > # this specification; bytes in the ASCII range that fail to decode
 > # will cause an exception. It is widely agreed that such encodings
 > # should not be used as locale charsets.

Which is your excuse for not supporting Shift JIS fully.  It doesn't
stop people from setting LC_ALL=ja_JP.shift_jis, or using Shift JIS as
the default encoding for certain media.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
Lino Mastrodomenico writes:
 > 2009/5/5 Stephen J. Turnbull :
 > > Third, it is not clear to me why non-decodable ASCII should be an
 > > error.
 > 
 > The PEP originally allowed the conversion to U+DCxx of bytes below 128
 > that cannot be decoded by the encoding used, but this creates
 > potential security problems.
 > 
 > See: 

Yeah, yeah, this is the same old same old from PEP 3131.  Anything
that handles the various attacks based on ASCII-alike characters
should at least rule out invalid Unicode, too!

And where is this U+DC2F supposed to be coming from, anyway?  The
user's *local* environment or the user's *local* filesystem!  Codecs
not using 'utf8b' can't produce it, so the only other cases are chr()
and \u literals in the *local* process, or an already broken module in
your code.  I really can't imagine that any sane programmer these days
would be using 'utf8b' on bytes received from the Internet!

Of course I can't prove that there's no vector for an exploit here (in
fact, I'm sure there is one with sufficiently careless handling of
input), but I think "consenting adults" covers the Shift JIS use case.
Make it an option, but it should be explicitly part of the PEP.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > Done: the Python-Version header already clarifies that point.

Ah, OK.  I wish my day job required reading more PEPs so I'd be more
familiar with these formalities. :-)

 > > Second, I suggest "surrogate-replace" as the name of the error handler
 > > rather than "utf8b".
 > 
 > I think this is bike-shedding.

I don't personally care (I already was aware of UTF-8B), but there are
plenty of others who do.  I think that's a good name to make
Marc-Andre and Terry happier.  You have to fix the existing uses of
the obsolete "python-escape", anyway.

 > It's a security risk. If U+DCXX would map to \xXX, then somebody could
 > embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
 > sanitized, nobody would expect that this will actually access ../

The odds that anybody will actually take notice of U+002E U+002E
U+002F in a string are sufficiently small that any number of exploits
have already been based on it.  I agree that there is some additional
risk from this if people make the check for "../" before they prepend
"\ucd2e\udc2e\udc2f", but I think that risk is very small compared to
the pain of having a error handler whose raison d'etre is to not raise
exceptions go ahead and raise them anyway.

See also my reply to Lino Mastrodomenico.  Again, an option is good
enough for my purposes as long as interfaces for os.listdir() and the
like support setting the error handler (cf. Zooko's proposal), but I
think the option should be available.

 > I tried to understand "surrogate", and it was explained to me that
 > "surrogate" is something that stands for something - but then I
 > would argue that the two subsequence codes form a surrogate - they
 > stand for something else. The individual surrogate code (in Unicode
 > terminology) doesn't stand for anything. So don't you agree that
 > it is the Unicode terminology that is in error, not the PEP?

Plausibly so.  Keep making comments like that and nobody will ever let
you off the hook for being a non-native speaker!

However, "surrogate" in English is typically used in situation that
are too complex to be covered by simply "substitution."  I've always
read "surrogate" as "alternative form of encoding", and "surrogate
code point" as "code point in that alternative form of encoding".
Where it's an alternative to code-point-is-scalar-value.  I think
probably the authors of the terminology just made the best of a bad
situation, I can't think of a better single word for this.

 > No. The specification puts no requirements on applications whatsoever.
 > So if you propose to use MUST NOT in the RFC 2119 sense, I strongly
 > disagree.

I do propose that.

But you're writing the PEP, so this battle will have to be deferred.
Eventually Python will have to take a stand on Unicode conformance,
but it's not urgent yet.

 > > 3.  In the discussion, the transition from the example of alternative
 > > use of 'python-escape' to discussion of the error handler
 > > interface extension is a bit abrupt.  I suggest rewriting as:
 > > 
 > > """The extension to the encode error handler interface proposed by
 > > this PEP is necessary to implement the 'utf8b' error handler,
 > > because there are required byte sequences which cannot be
 > > generated from replacement Unicode.  However, the encode error
 > > handler interface presently requires replacement Unicode to be
 > > provided in lieu of the non-encodable Unicode from the source
 > > string.  Then it promptly encodes that replacement Unicode.  In
 > > some error handlers, such as the 'utf8b' proposed here, it is also
 > > simpler and more efficient for the error handler to provide a
 > > pre-encoded replacement byte string, rather than forcing it to
 > > calculating Unicode from which the encoder would create the
 > > desired bytes."""
 > 
 > Unfortunately, I failed to understand where you want this text to
 > go. What paragraphs should I remove, or (if none), after which
 > paragraph should I insert this text?

Sorry!  I suggest substituting the paragraph above for the paragraph
which begins "The encode error handler interface presentlyrequires..."
at line 129.

I think I forgot to do this before:  "I hereby dedicate all text
I suggest for inclusion in the PEP to the public domain."



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com