Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-08 Thread Walter Dörwald
Stephen J. Turnbull wrote: > Walter Dörwald writes: > > > "surrogatepass" (for the "don't complain about lone half surrogates" > > handler) and "surrogatereplace" sound OK to me. However the other > > "...replace" handlers are destructive (i.e. when such a "...replace" > > handler is used for

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Stephen J. Turnbull
M.-A. Lemburg writes: > I'd use "allowlonesurrogates" as name for the "surrogates" error > handler and "lonesurrogatereplace" for the "utf8b" one. +1 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Glenn Linderman
On approximately 5/7/2009 3:27 PM, came the following characters from the keyboard of MRAB: Terry Reedy wrote: Martin v. Löwis wrote: So I'm happy to make it "surrogatepass" and "surrogateescape" as These seem adequate. It is not what I would choose or suggest, but it is adequate, and it

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread M.-A. Lemburg
Martin v. Löwis wrote: >> The error handler for undoing this operation (ie. when converting >> a Unicode string to some other encoding) should probably use the >> same name based on symmetry and the fact that the escaping >> scheme is meant to be used for enabling round-trip safety. > > Could you

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread MRAB
Terry Reedy wrote: Martin v. Löwis wrote: Given your explanation of what the new 'surrogates' handler does (pass rather than reject erroneous surrogates), I think 'surrogates_pass' is fine. Thus, I considoer that and 'surrogates_excape' the best proposal the best so far and suggest that you mak

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Terry Reedy
Martin v. Löwis wrote: Given your explanation of what the new 'surrogates' handler does (pass rather than reject erroneous surrogates), I think 'surrogates_pass' is fine. Thus, I considoer that and 'surrogates_excape' the best proposal the best so far and suggest that you make this pair the curr

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Gregory P. Smith
On Thu, May 7, 2009 at 12:39 PM, "Martin v. Löwis" wrote: >> Given your explanation of what the new 'surrogates' handler does (pass >> rather than reject erroneous surrogates), I think 'surrogates_pass' is >> fine.  Thus, I considoer that and 'surrogates_excape' the best proposal >> the best so fa

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis
> Given your explanation of what the new 'surrogates' handler does (pass > rather than reject erroneous surrogates), I think 'surrogates_pass' is > fine. Thus, I considoer that and 'surrogates_excape' the best proposal > the best so far and suggest that you make this pair the current status > quo

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Terry Reedy
Martin v. Löwis wrote: So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"? No, he's saying that your algorithm for choosing the PEP 383 handler should have come up with that name, rather than utf8b. But since PEP 383 applies to other codecs bes

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Stephen J. Turnbull
Walter Dörwald writes: > "surrogatepass" (for the "don't complain about lone half surrogates" > handler) and "surrogatereplace" sound OK to me. However the other > "...replace" handlers are destructive (i.e. when such a "...replace" > handler is used for encoding, decoding will not produce the

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis
> The error handler for undoing this operation (ie. when converting > a Unicode string to some other encoding) should probably use the > same name based on symmetry and the fact that the escaping > scheme is meant to be used for enabling round-trip safety. Could you please familiarize yourself wit

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis
> I haven't come up with anything I like better than errors="lenient" > for the old utf8 behavior handler; would errors="nonvalidating" be > correct? I think either is fairly unspecific. > For the utf8b error handler, I could see any of errors="roundtrip", > errors="roundtripreplace", errors="tos

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Martin v. Löwis
>> Well, there is a way to stack error handlers, although it's not pretty: >> [...] >> codecs.register_error("surrogates_then_replace", >> surrogates_then_replace) > > That mitigates my arguments significantly, although I'd rather see > something like errors=('surrogates', 're

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread MRAB
Walter Dörwald wrote: Michael Urman wrote: [...] Well, there is a way to stack error handlers, although it's not pretty: [...] codecs.register_error("surrogates_then_replace", surrogates_then_replace) That mitigates my arguments significantly, although I'd rather see some

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Walter Dörwald
Michael Urman wrote: > [...] >> Well, there is a way to stack error handlers, although it's not pretty: >> [...] >> codecs.register_error("surrogates_then_replace", >> surrogates_then_replace) > > That mitigates my arguments significantly, although I'd rather see > something

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Michael Urman
On Thu, May 7, 2009 at 01:16, "Martin v. Löwis" wrote: > I'm still at a loss what name to give it, though. I understand that > I have to rename both error handlers, but I'm uncertain what I should > rename them to. So proposals that rename only one of them aren't > that helpful. It would be helpfu

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Michael Urman
On Thu, May 7, 2009 at 00:43, "Martin v. Löwis" wrote: > Michael Urman wrote: >> On Wed, May 6, 2009 at 15:42, "Martin v. Löwis" wrote: >>> Despite there being also an error handler called "surrogates". >> >> Not that I have to be, but I'm not sold on the previous UTF-8 codec >> behavior becoming

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread MRAB
Martin v. Löwis wrote: Wouldn't renaming the existing "surrogates" handler be an incompatible change, and thus inappropriate? No - it's new in Python 3.1. So what do you think about Antoine's proposal? +1 Although it looks like it would be without the '-' for consistency with existing error

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Walter Dörwald
M.-A. Lemburg wrote: > Antoine Pitrou wrote: >> Martin v. Löwis v.loewis.de> writes: >>> py> b'\xed\xa0\x80'.decode("utf-8","surrogates") >>> '\ud800' >> The point is, "surrogates" does not mean anything intuitive for an /error >> handler/. You seem to be the only one who finds this name explicit

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread M.-A. Lemburg
Antoine Pitrou wrote: > Martin v. Löwis v.loewis.de> writes: >> py> b'\xed\xa0\x80'.decode("utf-8","surrogates") >> '\ud800' > > The point is, "surrogates" does not mean anything intuitive for an /error > handler/. You seem to be the only one who finds this name explicit enough, > perhaps because

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Glenn Linderman
On approximately 5/6/2009 11:16 PM, came the following characters from the keyboard of Martin v. Löwis: So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"? No, he's saying that your algorithm for choosing the PEP 383 handler should have come up

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> Wouldn't renaming the existing "surrogates" handler be an incompatible > change, and thus inappropriate? No - it's new in Python 3.1. So what do you think about Antoine's proposal? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org htt

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 10:53 PM, came the following characters from the keyboard of Martin v. Löwis: The error handler designed with utf-8 in mind has no name in the encode direction and is called "utf_8b_decoder_invalid_bytes" in the decode direction. By your reasoning, *that* should be its

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
>> So are you proposing that I should rename the PEP 383 handler >> to "utf_8b_encoder_invalid_codepoints"? > > > No, he's saying that your algorithm for choosing the PEP 383 handler > should have come up with that name, rather than utf8b. But since PEP > 383 applies to other codecs besides UTF-

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> By the way, what are the ASCII characters that are not suppported by > Shift-JIS? > Not many I suppose? (if I read the Wikipedia entry correctly, it's only the > backslash and the tilde). The problem with this encoding is that bytes below 128 appear as second bytes of a two-byte encoding: py>

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> The error handler designed with utf-8 in mind has no name in the encode > direction and is called "utf_8b_decoder_invalid_bytes" in the decode > direction. By your reasoning, *that* should be its name in Python. The > encoding error handler would then be named analogously > "utf_8b_encoder_inva

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
Michael Urman wrote: > On Wed, May 6, 2009 at 15:42, "Martin v. Löwis" wrote: >> Despite there being also an error handler called "surrogates". > > Not that I have to be, but I'm not sold on the previous UTF-8 codec > behavior becoming an error handler of the name "surrogates" for two > reasons (

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 6:06 PM, came the following characters from the keyboard of M.-A. Lemburg: Martin, please stop being silly and just change the name. Yes, please. If indeed Marc-Andre invented the codec business as he claims, he would be an appropriate person to give a fiat name t

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy
Martin v. Löwis wrote: Are you serious? Are you? ;-? You are the one naming a codec-agnostic error handler (if I understand correctly, and correct me if I do not) after a particular codec, and denying that that could cause confusion. See other message. I can only repeat what I said before: I

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > > Now, with Python's file system encoding == UTF-8 or any packed EUC, > > and more than a handful of Shift JIS or Big5 characters in file names, > > one is *almost certain* to encounter ASCII as the second byte of a > > multibyte sequence. PEP 383 can't handle this

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread M.-A. Lemburg
Martin v. Löwis wrote: The name "utf8b" suggested in the PEP is not in line with the codec design >>> Where is that design documented, and how exactly violates the name >>> the design (chapter and verse, please). >> Martin, I designed the whole Python codec machinery > > Not true. PEP 29

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Michael Urman
On Wed, May 6, 2009 at 15:42, "Martin v. Löwis" wrote: > Despite there being also an error handler called "surrogates". Not that I have to be, but I'm not sold on the previous UTF-8 codec behavior becoming an error handler of the name "surrogates" for two reasons (I do respect the obvious PBP arg

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Martin v. Löwis v.loewis.de> writes: > py> b'\xed\xa0\x80'.decode("utf-8","surrogates") > '\ud800' The point is, "surrogates" does not mean anything intuitive for an /error handler/. You seem to be the only one who finds this name explicit enough, perhaps because you chose it. Most other handlers

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> I qualify with a). I believe I understand c) but, as explained in my > other post, I do not think your reason applies. In fact, I think > concern for naming rights might suggest that you *not* reuse the name > for something different. I would have to learn more about the existing > 'surrogates'

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread MRAB
Antoine Pitrou wrote: Martin v. Löwis v.loewis.de> writes: Despite there being also an error handler called "surrogates". People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what t

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
>> Are you serious? > > Are you? ;-? You are the one naming a codec-agnostic error handler (if > I understand correctly, and correct me if I do not) after a particular > codec, and denying that that could cause confusion. See other message. I can only repeat what I said before: I call it utf8b

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy
Martin v. Löwis wrote: Antoine Pitrou wrote: Martin v. Löwis v.loewis.de> writes: Despite there being also an error handler called "surrogates". People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds q

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy
Martin v. Löwis wrote: Because utf8b (or, perhaps "UTF-8b") is the official name for this algorithm: http://hyperreal.org/~est/utf-8b/ Thank you for the link. It starts: "This directory contains a C implementation of a UTF-8b codec. A Python codec based on it is provided as well." 'RTF-8b' c

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Paul Moore
2009/5/6 Antoine Pitrou : > Martin v. Löwis v.loewis.de> writes: >> >> Despite there being also an error handler called "surrogates". > > People, perhaps we could end all the bikeshedding and call one of those > handlers > "surrogates-pass" and the other "surrogates-escape", which sounds quite >

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy
Martin v. Löwis wrote: +1 for "surrogate" as the name for the error handler. +1 from me also Despite there being also an error handler called "surrogates". Given that additional information which MAL apparently omitted, I would revise. Are you serious? Are you? ;-? You are the one

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
Antoine Pitrou wrote: > Martin v. Löwis v.loewis.de> writes: >> Despite there being also an error handler called "surrogates". > > People, perhaps we could end all the bikeshedding and call one of those > handlers > "surrogates-pass" and the other "surrogates-escape", which sounds quite > faith

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> But first, it should be stopped by any of several > standard precautions. For example, applying os.path.realpath (come to > think of it, PEP 383 should say something about realpath, shouldn't > it?) Why do you think so? I think the existing documentation of realpath is correct and complete. >

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Martin v. Löwis v.loewis.de> writes: > > Despite there being also an error handler called "surrogates". People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/? R

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> Is it only usable with utf8 as an encoding? No, it applies to any codec which potentially cannot decode all bytes >127. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
Terry Reedy wrote: > Glenn Linderman wrote: >> On approximately 5/6/2009 3:08 AM, came the following characters from >> the keyboard of MRAB: >>> M.-A. Lemburg wrote: Martin v. Löwis wrote: >> >>> Judging by the existing names, I think that 'surrogate' would be >>> reasonable. It already conta

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> Judging by the existing names, I think that 'surrogate' would be > reasonable MAL's list of existing names is incomplete. "surrogates" is already an existing name, also, and it means something different (similar, but different). Regards, Martin ___ Py

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> I'm sorry for the lack of clarity of my posts, but somehow you're > completely missing the point. The point is precisely that Python > *won't* use Shift JIS as the file system encoding (if it did there > would be no problem with reading Shift JIS), but the people who > created the media *did*. >

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
>>> The name "utf8b" suggested in the PEP is not in line with the codec >>> design >> Where is that design documented, and how exactly violates the name >> the design (chapter and verse, please). > > Martin, I designed the whole Python codec machinery Not true. PEP 293 was written and designed by

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 12:18 PM, came the following characters from the keyboard of Zooko Wilcox-O'Hearn: On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote: Zooko Wilcox-O'Hearn zooko.com> writes: I'm not thinking of API compatibility as much as data compatibility -- someone used Python

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Zooko Wilcox-O'Hearn
On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote: Zooko Wilcox-O'Hearn zooko.com> writes: I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greates

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy
Glenn Linderman wrote: On approximately 5/6/2009 3:08 AM, came the following characters from the keyboard of MRAB: M.-A. Lemburg wrote: Martin v. Löwis wrote: Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 12:53 AM, came the following characters from the keyboard of Martin v. Löwis: Sorry! I suggest substituting the paragraph above for the paragraph which begins "The encode error handler interface presentlyrequires..." at line 129. Ah, ok. This was Glen Linderman's te

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 3:08 AM, came the following characters from the keyboard of MRAB: M.-A. Lemburg wrote: Martin v. Löwis wrote: Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not too long, and the codes

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 6:33 AM, came the following characters from the keyboard of Stephen J. Turnbull: "Martin v. Löwis" writes: > In any case, Python 3.1b1 may get released today, so it's way too late > for new features in the PEP. They can wait for Python 3.2. You have convinced me that

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Zooko Wilcox-O'Hearn zooko.com> writes: > > I'm not thinking of API compatibility as much as > data compatibility -- someone used Python 3.1 to write down some > filenames, and now a few years later they are trying to use the > latest and greatest Python release to read those filenames...

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread James Y Knight
On May 6, 2009, at 5:39 AM, Stephen J. Turnbull wrote: Now, with Python's file system encoding == UTF-8 or any packed EUC, and more than a handful of Shift JIS or Big5 characters in file names, one is *almost certain* to encounter ASCII as the second byte of a multibyte sequence. PEP 383 can't h

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Zooko Wilcox-O'Hearn
On May 6, 2009, at 7:33 AM, Stephen J. Turnbull wrote: You have convinced me that the PEP should wait as well. In its current form it is incomplete and dangerous. +1 on delaying PEP 383 I think PEP 383 is a good idea in principle, but I'm still struggling to understand it myself, and it se

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread R. David Murray
On Wed, 6 May 2009 at 13:40, Antoine Pitrou wrote: Stephen J. Turnbull xemacs.org> writes: Nothing is lost compared to 'strict', true, but under the PEP as it is a large fraction of Shift JIS and Big5 filenames cannot be read under ASCII-compatible file system encodings using 'utf8b'. You sh

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Stephen J. Turnbull xemacs.org> writes: > > Nothing is lost compared to 'strict', true, but under the PEP as it is > a large fraction of Shift JIS and Big5 filenames cannot be read under > ASCII-compatible file system encodings using 'utf8b'. You should really be more specific. I'm not sure abou

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > > Yeah, yeah, this is the same old same old from PEP 3131. Anything > > that handles the various attacks based on ASCII-alike characters > > should at least rule out invalid Unicode, too! > > > > And where is this U+DC2F supposed to be coming from, anyway? The

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull
Lino Mastrodomenico writes: > It's a know problem with Shift-JIS and was fixed in UTF-8. It was fixed in EUC before Shift-JIS was invented by Microsoft or Big5 was invented by the Taiwanese clone makers. Guido's not the only language designer with a time machine __

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Lennart Regebro
On Wed, May 6, 2009 at 09:31, "Martin v. Löwis" wrote: > They *are* separate naemspaces; that's guaranteed by the implementation. Yes. But utf8b *sounds like* an encoding. When it isn't. I sure thought it was when it was first mentioned. I agree that it would be better to find another name. 'utf

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Lino Mastrodomenico
2009/5/6 Antoine Pitrou : > By the way, what are the ASCII characters that are not suppported by > Shift-JIS? > Not many I suppose? (if I read the Wikipedia entry correctly, it's only the > backslash and the tilde). The biggest problem with Shift-JIS is that a perfectly valid unicode character ab

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
MRAB mrabarnett.plus.com> writes: > > Judging by the existing names, I think that 'surrogate' would be > reasonable. It already contains the meaning of substitute, Only if you are a native English-speaker I suppose... For me it's just a technical term denoting a certain class of unicode code poi

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread MRAB
M.-A. Lemburg wrote: Martin v. Löwis wrote: The name "utf8b" suggested in the PEP is not in line with the codec design Where is that design documented, and how exactly violates the name the design (chapter and verse, please). Martin, I designed the whole Python codec machinery, so even if thi

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread M.-A. Lemburg
Martin v. Löwis wrote: >> The name "utf8b" suggested in the PEP is not in line with the codec >> design > > Where is that design documented, and how exactly violates the name > the design (chapter and verse, please). Martin, I designed the whole Python codec machinery, so even if this is not expl

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > I fail to see how this could ever matter. If, by "media", you mean > things like removable disks, and the file name encoding used on them, > it's fairly irrelevant for the PEP, since Python won't start using > Shift JIS as its file system encoding just because that'

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Martin v. Löwis v.loewis.de> writes: > > > I don't personally care (I already was aware of UTF-8B), but there are > > plenty of others who do. > > I think it is a fairly bad name, because it is easy to confuse it with > the "surrogates" error handler (unless you suggest to rename that also). I

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> Yeah, yeah, this is the same old same old from PEP 3131. Anything > that handles the various attacks based on ASCII-alike characters > should at least rule out invalid Unicode, too! > > And where is this U+DC2F supposed to be coming from, anyway? The > user's *local* environment or the user's

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> > > Second, I suggest "surrogate-replace" as the name of the error handler > > > rather than "utf8b". > > > > I think this is bike-shedding. > > I don't personally care (I already was aware of UTF-8B), but there are > plenty of others who do. I think it is a fairly bad name, because it is

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
Stephen J. Turnbull wrote: > "Martin v. Löwis" writes: > > > It occurs to me that the PEP maybe should say that it is an error > > > to have your POSIX locale set to UTF-16 or something like that. > > > > No. It is *impossible* to have UTF-16 as the locale character set, > > not an er

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> The name "utf8b" suggested in the PEP is not in line with the codec > design Where is that design documented, and how exactly violates the name the design (chapter and verse, please). > Error handlers and codecs are two different things, so the namespaces > need to be clearly separate. They *a

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > Done: the Python-Version header already clarifies that point. Ah, OK. I wish my day job required reading more PEPs so I'd be more familiar with these formalities. :-) > > Second, I suggest "surrogate-replace" as the name of the error handler > > rather than "utf8b

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
Lino Mastrodomenico writes: > 2009/5/5 Stephen J. Turnbull : > > Third, it is not clear to me why non-decodable ASCII should be an > > error. > > The PEP originally allowed the conversion to U+DCxx of bytes below 128 > that cannot be decoded by the encoding used, but this creates > potentia

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > > It occurs to me that the PEP maybe should say that it is an error > > to have your POSIX locale set to UTF-16 or something like that. > > No. It is *impossible* to have UTF-16 as the locale character set, > not an error. Your statement is like saying "it

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread M.-A. Lemburg
Martin v. Löwis wrote: >> I have three substantive comments. First, although consequences for >> Python 3 byte interfaces (ie, "none") are explicitly stated, as far as >> I can see this PEP could apply to Python 2 as well. I don't think >> it's intended that way. Either way, I think you should c

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Martin v. Löwis
> It occurs to me that the PEP maybe should say that it is an error > to have your POSIX locale set to UTF-16 or something like that. No. It is *impossible* to have UTF-16 as the locale character set, not an error. Your statement is like saying "it is an error to breathe in the vacuum". I

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Martin v. Löwis
> I have three substantive comments. First, although consequences for > Python 3 byte interfaces (ie, "none") are explicitly stated, as far as > I can see this PEP could apply to Python 2 as well. I don't think > it's intended that way. Either way, I think you should clarify that > point. Done:

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Martin v. Löwis
> > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - > > > it's an algorithm based on 16-bit or 32-bit code points. > > I don't understand this phrasing. The algorithm is only applicable to > ASCII-compatible octet streams. It results in code points by a simple > disp

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Lino Mastrodomenico
2009/5/5 Stephen J. Turnbull : > Third, it is not clear to me why non-decodable ASCII should be an > error. The PEP originally allowed the conversion to U+DCxx of bytes below 128 that cannot be decoded by the encoding used, but this creates potential security problems. See:

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
MRAB writes: > [snip] > It might be slightly OT, but sometimes strict UTF-8 encoding is violated > by encoding U+ using 2 bytes (0xC0 0x80) so that 0x00 can be used as > a terminator. I think I read that Microsoft sometimes does this. Nice hack! as long as you don't let it escape. But if

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread MRAB
Stephen J. Turnbull wrote: MRAB writes: > > I don't think "people shouldn't be using non-ASCII-compatible > > encodings for locale encodings" is a sufficient rationale for a hard > > error here. I mean, of course they *should* be using UTF-8. Maybe > > Python 3.1 should just go ahead and e

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
MRAB writes: > > I don't think "people shouldn't be using non-ASCII-compatible > > encodings for locale encodings" is a sufficient rationale for a hard > > error here. I mean, of course they *should* be using UTF-8. Maybe > > Python 3.1 should just go ahead and error on any other encoding on

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
Zooko O'Whielacronx writes: > How would an application make sure that they were producing only > valid unicode? That's very difficult. There are a couple of sources that I can think of, in Python: C modules, chr(), \u literals, and now codecs with the 'utf8b'. There may be others. You'd need

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread MRAB
Stephen J. Turnbull wrote: "Martin v. Löwis" writes: > I've updated the PEP accordingly. I have three substantive comments. First, although consequences for Python 3 byte interfaces (ie, "none") are explicitly stated, as far as I can see this PEP could apply to Python 2 as well. I don't thin

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Zooko O'Whielacronx
On Tue, May 5, 2009 at 8:57 AM, Stephen J. Turnbull wrote: > > 2.  The specification should state, and the discussion emphasize, that >    strings which were produced by surrogate replacement *must not* be >    used in data interchange with systems that do not specifically >    accept such strings

[Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > I've updated the PEP accordingly. I have three substantive comments. First, although consequences for Python 3 byte interfaces (ie, "none") are explicitly stated, as far as I can see this PEP could apply to Python 2 as well. I don't think it's intended that way. Ei

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Stephen J. Turnbull
M.-A. Lemburg writes: > On 2009-05-03 19:39, Martin v. Löwis wrote: > >> If the error handler is supposed to be used for codecs other than utf-8, > >> perhaps it should renamed something more generic, e.g. "surrogate-escape"? > > > > Perhaps. However, utf-8b doesn't really have to do anything

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread Terry Reedy
M.-A. Lemburg wrote: On 2009-05-03 19:39, Martin v. Löwis wrote: If the error handler is supposed to be used for codecs other than utf-8, perhaps it should renamed something more generic, e.g. "surrogate-escape"? Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an a

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-05 Thread M.-A. Lemburg
On 2009-05-03 19:39, Martin v. Löwis wrote: >> If the error handler is supposed to be used for codecs other than utf-8, >> perhaps it should renamed something more generic, e.g. "surrogate-escape"? > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - > it's an algorithm bas

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Gregory P. Smith
On Sun, May 3, 2009 at 1:27 PM, "Martin v. Löwis" wrote: > > > If the error handler is supposed to be used for codecs other than > > utf-8, > > > perhaps it should renamed something more generic, e.g. > > "surrogate-escape"? > > > > Perhaps. However, utf-8b doesn't really have

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Martin v. Löwis
> > If the error handler is supposed to be used for codecs other than > utf-8, > > perhaps it should renamed something more generic, e.g. > "surrogate-escape"? > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - > it's an algorithm based on 16-bit o

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Gregory P. Smith
On Sun, May 3, 2009 at 10:39 AM, "Martin v. Löwis" wrote: > > If the error handler is supposed to be used for codecs other than utf-8, > > perhaps it should renamed something more generic, e.g. > "surrogate-escape"? > > Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - > it'

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Martin v. Löwis
> If the error handler is supposed to be used for codecs other than utf-8, > perhaps it should renamed something more generic, e.g. "surrogate-escape"? Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an algorithm based on 16-bit or 32-bit code points. > Also, if utf8

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Martin v. Löwis
> That's even nicer. One minor detail though, in the sentence: > > "non-decodable bytes >128 will be represented as lone half surrogate" > > ">" should be ">=". Thanks, fixed. Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.p

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Michael Urman
On Sun, May 3, 2009 at 08:43, Antoine Pitrou wrote: > Also, if utf8-b is not provided as a codec, will there be an easy way for user > code to use the same encoding as the IO layer does? (e.g. > os.fsdecode/os.fsencode)? I like the idea of fsencode/fsdecode functions, but we need to be careful de

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Antoine Pitrou
Martin v. Löwis v.loewis.de> writes: > > Glenn Linderman suggested that the name "python-escape" is not very > descriptive, so I've changed the name to "utf8b". If the error handler is supposed to be used for codecs other than utf-8, perhaps it should renamed something more generic, e.g. "surrog

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Lino Mastrodomenico
2009/5/3 "Martin v. Löwis" : > With issue 3672 resolved, it is now unnecessary to introduce > an utf-8b codec, since the utf-8 codec will properly report errors > for all byte sequences invalid in UTF-8, including lone surrogates. > Therefore, utf-8b can be implemented solely through the error hand

[Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-03 Thread Martin v. Löwis
With issue 3672 resolved, it is now unnecessary to introduce an utf-8b codec, since the utf-8 codec will properly report errors for all byte sequences invalid in UTF-8, including lone surrogates. Therefore, utf-8b can be implemented solely through the error handler. Glenn Linderman suggested that