Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Olemis Lang
Probably one part of this is OT , but I think it could complement the discussion ;o) On Mon, Jan 11, 2010 at 3:44 PM, M.-A. Lemburg wrote: > Olemis Lang wrote: >>> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner >>> wrote: Hi, Builtin open() function is unable to open an UTF-16/32

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread M.-A. Lemburg
Olemis Lang wrote: >> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner >> wrote: >>> Hi, >>> >>> Builtin open() function is unable to open an UTF-16/32 file starting with a >>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8 >>> file starting with a BOM, read()/readline()

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Olemis Lang
> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner > wrote: >> Hi, >> >> Builtin open() function is unable to open an UTF-16/32 file starting with a >> BOM if the encoding is not specified (raise an unicode error). For an UTF-8 >> file starting with a BOM, read()/readline() returns also the BOM wher

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Martin v. Löwis
> I must say that I find this whole thing pretty obvious. 'BOM' is not > an encoding. That I certainly agree with. > That covers all usecases, is easy and obvious. Either open(file=foo, > encoding=None) or open(file, encoding=encoding_from_bom(file)) > > I can't see that open(file, encoding='BOM

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread MRAB
Lennart Regebro wrote: On Mon, Jan 11, 2010 at 11:37, Walter Dörwald wrote: UTF-8 might be a good choice No, fallback if there is no BOM should be the local settings, just as fallback is today if you don't specify a codec. I mean, what if you want to look for a BOM but fall back to something

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 18:16, "Martin v. Löwis" wrote: >> But an autodetect feature is not a codec. Sure it should be reusable, >> but making it a codec seems to be  a weird hack to me. > > Well, the existing UTF-16 codec also is an autodetect feature (to > detect the endianness), and I don't con

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Martin v. Löwis
> But an autodetect feature is not a codec. Sure it should be reusable, > but making it a codec seems to be a weird hack to me. Well, the existing UTF-16 codec also is an autodetect feature (to detect the endianness), and I don't consider it a weird hack. Regards, Martin

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 14:21, Walter Dörwald wrote: > I think we already had this discussion two years ago in the context of > XML decoding ;): Yup. Ans Martins answer then is my answer now: "> So the code is good, if it is inside an XML parser, and it's bad if it > is inside a codec? Exactly

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Walter Dörwald
On 11.01.10 13:45, Lennart Regebro wrote: > On Mon, Jan 11, 2010 at 13:29, Walter Dörwald wrote: >> However if this autodetection feature is useful in other cases (no >> matter how it's activated), it should be a codec, because as part of the >> open() function it isn't reusable. > > But an auto

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 13:29, Walter Dörwald wrote: > However if this autodetection feature is useful in other cases (no > matter how it's activated), it should be a codec, because as part of the > open() function it isn't reusable. But an autodetect feature is not a codec. Sure it should be reu

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Antoine Pitrou
> However if this autodetection feature is useful in other cases (no > matter how it's activated), it should be a codec, because as part of the > open() function it isn't reusable. It is reusable as part of io.TextIOWrapper, though. Regards Antoine. ___

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Walter Dörwald
On 10.01.10 00:40, "Martin v. Löwis" wrote: >>> How does the requirement that it be implemented as a codec miss the >>> point? >> >> If we want it to be the default, it must be able to fallback on the current >> locale-based algorithm if no BOM is found. I don't think it would be easy >> for a >>

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 12:12, Lennart Regebro wrote: > BOM is not a locale, and should not be a locale. Having a locale > called BOM is wrong per se. D'oh! I mean codec here obviously. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 _

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Lennart Regebro
On Mon, Jan 11, 2010 at 11:37, Walter Dörwald wrote: > UTF-8 might be a good choice No, fallback if there is no BOM should be the local settings, just as fallback is today if you don't specify a codec. I mean, what if you want to look for a BOM but fall back to something else? How far will we go

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Walter Dörwald
On 09.01.10 14:38, Victor Stinner wrote: > Le samedi 09 janvier 2010 12:18:33, Walter Dörwald a écrit : >>> Good idea, I choosed open(filename, encoding="BOM"). >> >> On the surface this looks like there's an encoding named "BOM", but >> looking at your patch I found that the check is still done i

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-10 Thread Lennart Regebro
On Sun, Jan 10, 2010 at 12:10, Henning von Bargen wrote: > If Python should support BOM when reading text files, > it should also be able to *write* such files. That's what I thought too. Turns out the UTF-16 does write such a mark. You also have the constants in the codecs module, so you can wri

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-10 Thread Henning von Bargen
If Python should support BOM when reading text files, it should also be able to *write* such files. An encoding="BOM" argument wouldn't help here, because it does not specify which encoding to use actually: UFT-8, UTF-16-LE or what? That would be a point against encoding="BOM" and pro an additio

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Martin v. Löwis
>> How does the requirement that it be implemented as a codec miss the >> point? > > If we want it to be the default, it must be able to fallback on the current > locale-based algorithm if no BOM is found. I don't think it would be easy for > a > codec to do that. Yes - however, Victor currently

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Michael Foord
On 09/01/2010 22:14, Lennart Regebro wrote: On Sat, Jan 9, 2010 at 21:28, Antoine Pitrou wrote: If we want it to be the default, it must be able to fallback on the current locale-based algorithm if no BOM is found. I don't think it would be easy for a codec to do that. Right. It seem

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Lennart Regebro
On Sat, Jan 9, 2010 at 21:28, Antoine Pitrou wrote: > If we want it to be the default, it must be able to fallback on the current > locale-based algorithm if no BOM is found. I don't think it would be easy for > a > codec to do that. Right. It seems like encoding=None is the right way to go ther

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Antoine Pitrou
Martin v. Löwis v.loewis.de> writes: > > > Sorry but this is missing the point. The point here is to improve the open() > > function. I'm sure people who know about encodings are able to install the > > chardet library or even whip up their own BOM detection routine... > > How does the requireme

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Martin v. Löwis
Antoine Pitrou wrote: > Walter Dörwald livinglogic.de> writes: >> On the surface this looks like there's an encoding named "BOM", but >> looking at your patch I found that the check is still done in >> TextIOWrapper. IMHO the best approach would to the implement a *real* >> codec named "BOM" (o

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Antoine Pitrou
Walter Dörwald livinglogic.de> writes: > > On the surface this looks like there's an encoding named "BOM", but > looking at your patch I found that the check is still done in > TextIOWrapper. IMHO the best approach would to the implement a *real* > codec named "BOM" (or "sniff"). This doesn't

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Victor Stinner
Le samedi 09 janvier 2010 12:18:33, Walter Dörwald a écrit : > > Good idea, I choosed open(filename, encoding="BOM"). > > On the surface this looks like there's an encoding named "BOM", but > looking at your patch I found that the check is still done in > TextIOWrapper. IMHO the best approach woul

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Walter Dörwald
Victor Stinner wrote: Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit : Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also t

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Nick Coghlan
MRAB wrote: > Maybe there should also be a way of determining what encoding it decided > it was, so that you can then write a new file in that same encoding. I thought of that question as well - the f.encoding attribute on the opened file should be sufficient. Cheers, Nick. -- Nick Coghlan |

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Georg Brandl
Am 08.01.2010 22:14, schrieb Tres Seaver: >> FWIW, I'm personally in favor of using the UTF-8 signature. If people >> consider them crazy talk, that may be because UTF-8 can't possibly have >> a byte order - hence I call it a signature, not the BOM. As a signature, >> I don't consider it crazy at

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 22:40:47, Eric Smith a écrit : > >> Shouldn't this encoding guessing be a separate function that you call > >> on either a file or a seekable stream ? > >> > >> After all, detecting encodings is just as useful to have for non-file > >> streams. > > > > Other stream sourc

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Eric Smith wrote: >>> Shouldn't this encoding guessing be a separate function that you call >>> on either a file or a seekable stream ? >>> >>> After all, detecting encodings is just as useful to have for non-file >>> streams. >> Other stream sources t

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread M.-A. Lemburg
Tres Seaver wrote: > M.-A. Lemburg wrote: > >> Shouldn't this encoding guessing be a separate function that you call >> on either a file or a seekable stream ? > >> After all, detecting encodings is just as useful to have for non-file >> streams. > > Other stream sources typically have out-of-ba

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread James Y Knight
On Jan 8, 2010, at 4:14 PM, Tres Seaver wrote: I understood this proposal as a general processing guideline, not something the io library should do (but, say, a text editor). FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Eric Smith
>> Shouldn't this encoding guessing be a separate function that you call >> on either a file or a seekable stream ? >> >> After all, detecting encodings is just as useful to have for non-file >> streams. > > Other stream sources typically have out-of-band ways to signal the > encoding: only when r

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Martin v. Löwis wrote: >>> It *is* crazy, but unfortunately rather common. Wikipedia has a good >>> description of the issues: >>> . Basically, some >>> Windows text APIs will emit a UTF-8 "BOM" in

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 M.-A. Lemburg wrote: > Shouldn't this encoding guessing be a separate function that you call > on either a file or a seekable stream ? > > After all, detecting encodings is just as useful to have for non-file > streams. Other stream sources typicall

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Guido van Rossum wrote: > On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver wrote: >> The BOM should not be seekeable if the file is opened with the proposed >> "guess encoding from BOM" mode: it isn't properly part of the stream at >> all in that case. >

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread MRAB
Victor Stinner wrote: Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit : (...) (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) I wrote a new version of my patch (version 3):

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Antoine Pitrou
Guido van Rossum python.org> writes: > > On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver palladion.com> wrote: > > The BOM should not be seekeable if the file is opened with the proposed > > "guess encoding from BOM" mode: it isn't properly part of the stream at > > all in that case. > > This fee

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread M.-A. Lemburg
Guido van Rossum wrote: > On Fri, Jan 8, 2010 at 6:34 AM, Antoine Pitrou wrote: >> Victor Stinner haypocalc.com> writes: >>> >>> I wrote a new version of my patch (version 3): >>> >>> * don't change the default behaviour: use open(filename, encoding="BOM") to >>> check the BOM is there is any >>

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Antoine Pitrou
Guido van Rossum python.org> writes: > > > Well, I think if we implement this the default behaviour *should* be > > changed. > > It looks a bit senseless to have two different "auto-choose" options, one with > > encoding=None and one with encoding="BOM". > > Well there *are* two different auto

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver wrote: > The BOM should not be seekeable if the file is opened with the proposed > "guess encoding from BOM" mode:  it isn't properly part of the stream at > all in that case. This feels about right to me. There are still questions though: immediately

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Fri, Jan 8, 2010 at 1:05 AM, "Martin v. Löwis" wrote: >>> It *is* crazy, but unfortunately rather common.  Wikipedia has a good >>> description of the issues: >>> .  Basically, some >>> Windows text APIs will emit a UTF-8 "BOM" in order to ide

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Fri, Jan 8, 2010 at 6:34 AM, Antoine Pitrou wrote: > Victor Stinner haypocalc.com> writes: >> >> I wrote a new version of my patch (version 3): >> >>  * don't change the default behaviour: use open(filename, encoding="BOM") to >> check the BOM is there is any > > Well, I think if we implement

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Guido van Rossum
On Thu, Jan 7, 2010 at 11:55 PM, Glyph Lefkowitz wrote: > I'm saying that the BOM itself isn't enough to detect that the file is > actually UTF-8. And I'm saying that it is, with as much certainty as we can ever guess the encoding of a file. > If (for whatever reason: explicitly specified, gues

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Antoine Pitrou
Victor Stinner haypocalc.com> writes: > > I wrote a new version of my patch (version 3): > > * don't change the default behaviour: use open(filename, encoding="BOM") to > check the BOM is there is any Well, I think if we implement this the default behaviour *should* be changed. It looks a bit

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit : > > Builtin open() function is unable to open an UTF-16/32 file starting with > > a BOM if the encoding is not specified (raise an unicode error). For an > > UTF-8 file starting with a BOM, read()/readline() returns also the BOM > > wh

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 01:52:20, Guido van Rossum a écrit : > And for the other two, perhaps it would make more sense to have > a separate encoding-guessing function that takes a binary stream and > returns a text stream wrapping it with the proper encoding? I choosed to modify open()+TextIOW

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit : (...) > (And yes, I know this happens. Doesn't mean we need to auto-guess by > default; there are lots of issues e.g. what should happen after > seeking to offset 0?) I wrote a new version of my patch (version 3): * don't change th

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Martin v. Löwis
> Builtin open() function is unable to open an UTF-16/32 file starting with a > BOM if the encoding is not specified (raise an unicode error). For an UTF-8 > file starting with a BOM, read()/readline() returns also the BOM whereas the > BOM should be "ignored". It depends. If you use the utf-8-

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Victor Stinner
Le vendredi 08 janvier 2010 03:23:08, MRAB a écrit : > Guido van Rossum wrote: > > I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy > > talk. And for the other two, perhaps it would make more sense to have > > a separate encoding-guessing function that takes a binary stream and

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Martin v. Löwis
> But it should do something sane when reading such files. I can't > really see any harm in throwing it away, especially since use of > ZERO-WIDTH NO-BREAK SPACE as a joining character has been deprecated > IIRC. And indeed it does, when you open the file in the utf-8-sig encoding. Regards, Mart

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-08 Thread Martin v. Löwis
>> It *is* crazy, but unfortunately rather common. Wikipedia has a good >> description of the issues: >> . Basically, some >> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as >> being UTF-8, so it's become a convention

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Glyph Lefkowitz
On Jan 7, 2010, at 11:21 PM, Guido van Rossum wrote: > On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz > wrote: >> >> On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: >>> >>> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy >>> talk. And for the other two, perhaps it wo

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Guido van Rossum wrote: > On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz > wrote: >> >> On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: >> >> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner >> wrote: >> >> Hi, >> >> Builtin open() function is un

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Stephen J. Turnbull
Guido van Rossum writes: > I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy > talk. That doesn't stop many applications from doing it. Python should perhaps not produce UTF-8 + BOM without a disclaimer of indemnification against all resulting damage, signed in blood, from t

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Guido van Rossum
On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz wrote: > > > On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: > > On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner > wrote: > > Hi, > > Builtin open() function is unable to open an UTF-16/32 file starting with a > > BOM if the encoding is not speci

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Glyph Lefkowitz
On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: > On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner > wrote: >> Hi, >> >> Builtin open() function is unable to open an UTF-16/32 file starting with a >> BOM if the encoding is not specified (raise an unicode error). For an UTF-8 >> file starting

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread MRAB
Guido van Rossum wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? Alte

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Guido van Rossum
I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? --Guido On Thu, Jan 7, 2010 a

[Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Victor Stinner
Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be "ignored". See recent issues related to reading