Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Stephen J. Turnbull
> "MvL" == "Martin v. Löwis" <[EMAIL PROTECTED]> writes: MvL> This would also support your usecase, and in a better way. MvL> The Unicode assertion that UTF-16 is BE by default is void MvL> these days - there is *always* a higher layer protocol, and MvL> it more often than not

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread M.-A. Lemburg
Martin v. Löwis wrote: > Nicholas Bastin wrote: > >>It would be nice if you could optionally specify that the codec would >>assume UTF-16BE if no BOM was present, and not raise UnicodeError in >>that case, which would preserve the current behaviour as well as allow >>users' to ask for behaviour wh

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Walter Dörwald
Walter Dörwald sagte: > Nicholas Bastin sagte: > > It should be feasible to implement your own codec for that > based on Lib/encodings/utf_16.py. Simply replace the line > in StreamReader.decode(): > raise UnicodeError,"UTF-16 stream does not start with BOM" > with: > self.decode = codecs.utf_

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Martin v. Löwis
Nicholas Bastin wrote: > It would be nice if you could optionally specify that the codec would > assume UTF-16BE if no BOM was present, and not raise UnicodeError in > that case, which would preserve the current behaviour as well as allow > users' to ask for behaviour which conforms to the standard

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Walter Dörwald
Nicholas Bastin sagte: > On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote: > > [...] >> If you do have UTF-16 without a BOM mark it's much better >> to let a short function analyze the text by reading for first >> few bytes of the file and then make an educated guess based >> on the findings. You

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Nicholas Bastin
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote: Ok, but I don't really follow you here: you are suggesting to relax the current UTF-16 behavior and to start defaulting to UTF-16-BE if no BOM is present - that's most likely going to cause more problems that it seems to solve: namely complete garba

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread M.-A. Lemburg
Nicholas Bastin wrote: > > On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote: > >>> The current implementation of the utf-16 codecs makes for some >>> irritating gymnastics to write the BOM into the file before reading it >>> if it contains no BOM, which seems quite like a bug in the codec. >> >> >

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread Nicholas Bastin
On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote: The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec. The codec writes a BOM in the first call to .write

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-07 Thread M.-A. Lemburg
Nicholas Bastin wrote: > > On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote: > >> Note that the UTF-16 codec is strict w/r to the presence >> of the BOM mark: you get a UnicodeError if a stream does >> not start with a BOM mark. For the UTF-8-SIG codec, this >> should probably be relaxed to not re

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Stephen J. Turnbull
> "Walter" == Walter Dörwald <[EMAIL PROTECTED]> writes: Walter> Not really. In every encoding where a sequence of more Walter> than one byte maps to one Unicode character, you will Walter> always need some kind of buffering. If we remove the Walter> handling of initial BOMs fr

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Nicholas Bastin
On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote: Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM. I've actually been confused

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Martin v. Löwis
Stephen J. Turnbull wrote: Because the signature/BOM is not a chunk, it's a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind. I'm sorry, but I'm losing track as to what precisely you are trying to say. You seem to be using a mental model that is enti

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Walter Dörwald
Stephen J. Turnbull wrote: "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes: Martin> I can't put these two paragraphs together. If you think Martin> that explicit is better than implicit, why do you not want Martin> to make different calls for the first chunk of a stream, Marti

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes: Martin> I can't put these two paragraphs together. If you think Martin> that explicit is better than implicit, why do you not want Martin> to make different calls for the first chunk of a stream, Martin> and the subsequen

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-06 Thread Walter Dörwald
Martin v. Löwis sagte: > Walter Dörwald wrote: >> There are situations where the byte stream might be temporarily >> exhausted, e.g. an XML parser that tries to support the >> IncrementalParser interface, or when you want to decode >> encoded data piecewise, because you want to give a progress >> r

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Stephen J. Turnbull wrote: Of course it must be supported. My point is that many strings (in my applications, all but those strings that result from slurping in a file or process output in one go -- example, not a statistically valid sample!) are not the beginning of "what once was a stream". It

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes: Martin> So people do use the "decode-it-all" mode, where no Martin> sequential access is necessary - yet the beginning of the Martin> string is still the beginning of what once was a Martin> stream. This case must be supp

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread "Martin v. Löwis"
Walter Dörwald wrote: There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report. Yes, but these are not file-like

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Evan Jones sagte: > On Apr 5, 2005, at 15:33, Walter Dörwald wrote: >> The stateful decoder has a little problem: At least three bytes >> have to be available from the stream until the StreamReader >> decides whether these bytes are a BOM that has to be skipped. >> This means that if the file only

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Martin v. Löwis sagte: > Walter Dörwald wrote: >> The stateful decoder has a little problem: At least three bytes >> have to be available from the stream until the StreamReader >> decides whether these bytes are a BOM that has to be skipped. >> This means that if the file only contains "ab", the us

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Walter Dörwald wrote: The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two character

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Fred Drake
On Tuesday 05 April 2005 15:53, Evan Jones wrote: > This functionality is provided by a flush() method on similar objects, > such as the zlib compression objects. Or by close() on other objects (htmllib, HTMLParser, the SAX incremental parser, etc.). Too bad there's more than one way to do it.

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Evan Jones
On Apr 5, 2005, at 15:33, Walter Dörwald wrote: The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will nev

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Walter Dörwald sagte: > M.-A. Lemburg wrote: > >>> [...] >>>With the UTF-8-SIG codec, it would apply to all operation >>> modes of the codec, whether stream-based or from strings. Whether >>>or not to use the codec would be the application's choice. >> >> I'd suggest to use the same mode of operat

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Stephen J. Turnbull wrote: Martin> With the UTF-8-SIG codec, it would apply to all operation Martin> modes of the codec, whether stream-based or from strings. I had in mind the ability to treat a string as a stream. Hmm. A string is not a stream, but it could be the contents of a stream. A

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
>>"MAL" == M <[EMAIL PROTECTED]> writes: MAL> Stephen J. Turnbull wrote: >> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it >> even adds them to existing UTF-8 files lacking them. MAL> Is that a MS application ? AFAIK, notepad, wordpad and MS MAL> Office alwa

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes: Martin> Stephen J. Turnbull wrote: >> However, this option should be part of the initialization of an >> IO stream which produces Unicodes, _not_ an operation on >> arbitrary internal strings (whether raw or Unicode).

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Stephen J. Turnbull wrote: >>"MAL" == M <[EMAIL PROTECTED]> writes: > > > MAL> The BOM (byte order mark) was a non-standard Microsoft > MAL> invention to detect Unicode text data as such (MS always uses > MAL> UTF-16-LE for Unicode text files). > > The Japanese "memopado" (Notep

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
M.-A. Lemburg wrote: [...] With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application's choice. I'd suggest to use the same mode of operation as we have in the UTF-16 codec: it removes

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Martin v. Löwis wrote: > Stephen J. Turnbull wrote: > >> So there is a standard for the UTF-8 signature, and I know of >> applications which produce it. While I agree with you that Python's >> codecs shouldn't produce it (by default), providing an option to strip >> is a good idea. > > I would p

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread "Martin v. Löwis"
Stephen J. Turnbull wrote: So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python's codecs shouldn't produce it (by default), providing an option to strip is a good idea. I would personally like to see an "utf-8-bom" codec (p

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-04 Thread Stephen J. Turnbull
> "MAL" == M <[EMAIL PROTECTED]> writes: MAL> The BOM (byte order mark) was a non-standard Microsoft MAL> invention to detect Unicode text data as such (MS always uses MAL> UTF-16-LE for Unicode text files). The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds th

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread Evan Jones
On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote: The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files). Well, it's origins do not really matter since at this point the BOM is firmly encoded in the Unicod

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread M.-A. Lemburg
Evan Jones wrote: > I recently rediscovered this strange behaviour in Python's Unicode > handling. I *think* it is a bug, but before I go and try to hack > together a patch, I figure I should run it by the experts here on > Python-Dev. If you understand Unicode, please let me know if there are > pr

[Python-Dev] Unicode byte order mark decoding

2005-04-01 Thread Evan Jones
I recently rediscovered this strange behaviour in Python's Unicode handling. I *think* it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are problems with making these