> "MvL" == "Martin v. Löwis" <[EMAIL PROTECTED]> writes:
MvL> This would also support your usecase, and in a better way.
MvL> The Unicode assertion that UTF-16 is BE by default is void
MvL> these days - there is *always* a higher layer protocol, and
MvL> it more often than not
Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>
>>It would be nice if you could optionally specify that the codec would
>>assume UTF-16BE if no BOM was present, and not raise UnicodeError in
>>that case, which would preserve the current behaviour as well as allow
>>users' to ask for behaviour wh
Walter Dörwald sagte:
> Nicholas Bastin sagte:
>
> It should be feasible to implement your own codec for that
> based on Lib/encodings/utf_16.py. Simply replace the line
> in StreamReader.decode():
> raise UnicodeError,"UTF-16 stream does not start with BOM"
> with:
> self.decode = codecs.utf_
Nicholas Bastin wrote:
> It would be nice if you could optionally specify that the codec would
> assume UTF-16BE if no BOM was present, and not raise UnicodeError in
> that case, which would preserve the current behaviour as well as allow
> users' to ask for behaviour which conforms to the standard
Nicholas Bastin sagte:
> On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
>
> [...]
>> If you do have UTF-16 without a BOM mark it's much better
>> to let a short function analyze the text by reading for first
>> few bytes of the file and then make an educated guess based
>> on the findings. You
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
Ok, but I don't really follow you here: you are suggesting to
relax the current UTF-16 behavior and to start defaulting to
UTF-16-BE if no BOM is present - that's most likely going to
cause more problems that it seems to solve: namely complete
garba
Nicholas Bastin wrote:
>
> On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
>
>>> The current implementation of the utf-16 codecs makes for some
>>> irritating gymnastics to write the BOM into the file before reading it
>>> if it contains no BOM, which seems quite like a bug in the codec.
>>
>>
>
On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
The current implementation of the utf-16 codecs makes for some
irritating gymnastics to write the BOM into the file before reading it
if it contains no BOM, which seems quite like a bug in the codec.
The codec writes a BOM in the first call to .write
Nicholas Bastin wrote:
>
> On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
>
>> Note that the UTF-16 codec is strict w/r to the presence
>> of the BOM mark: you get a UnicodeError if a stream does
>> not start with a BOM mark. For the UTF-8-SIG codec, this
>> should probably be relaxed to not re
> "Walter" == Walter Dörwald <[EMAIL PROTECTED]> writes:
Walter> Not really. In every encoding where a sequence of more
Walter> than one byte maps to one Unicode character, you will
Walter> always need some kind of buffering. If we remove the
Walter> handling of initial BOMs fr
On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.
I've actually been confused
Stephen J. Turnbull wrote:
Because the signature/BOM is not a chunk, it's a header. Handling the
signature/BOM is part of stream initialization, not translation, to my
mind.
I'm sorry, but I'm losing track as to what precisely you are trying to
say. You seem to be using a mental model that is enti
Stephen J. Turnbull wrote:
"Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:
Martin> I can't put these two paragraphs together. If you think
Martin> that explicit is better than implicit, why do you not want
Martin> to make different calls for the first chunk of a stream,
Marti
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:
Martin> I can't put these two paragraphs together. If you think
Martin> that explicit is better than implicit, why do you not want
Martin> to make different calls for the first chunk of a stream,
Martin> and the subsequen
Martin v. Löwis sagte:
> Walter Dörwald wrote:
>> There are situations where the byte stream might be temporarily
>> exhausted, e.g. an XML parser that tries to support the
>> IncrementalParser interface, or when you want to decode
>> encoded data piecewise, because you want to give a progress
>> r
Stephen J. Turnbull wrote:
Of course it must be supported. My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of "what once was a stream". It
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:
Martin> So people do use the "decode-it-all" mode, where no
Martin> sequential access is necessary - yet the beginning of the
Martin> string is still the beginning of what once was a
Martin> stream. This case must be supp
Walter Dörwald wrote:
There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.
Yes, but these are not file-like
Evan Jones sagte:
> On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
>> The stateful decoder has a little problem: At least three bytes
>> have to be available from the stream until the StreamReader
>> decides whether these bytes are a BOM that has to be skipped.
>> This means that if the file only
Martin v. Löwis sagte:
> Walter Dörwald wrote:
>> The stateful decoder has a little problem: At least three bytes
>> have to be available from the stream until the StreamReader
>> decides whether these bytes are a BOM that has to be skipped.
>> This means that if the file only contains "ab", the us
Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
never see these two character
On Tuesday 05 April 2005 15:53, Evan Jones wrote:
> This functionality is provided by a flush() method on similar objects,
> such as the zlib compression objects.
Or by close() on other objects (htmllib, HTMLParser, the SAX incremental
parser, etc.).
Too bad there's more than one way to do it.
On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
nev
Walter Dörwald sagte:
> M.-A. Lemburg wrote:
>
>>> [...]
>>>With the UTF-8-SIG codec, it would apply to all operation
>>> modes of the codec, whether stream-based or from strings. Whether
>>>or not to use the codec would be the application's choice.
>>
>> I'd suggest to use the same mode of operat
Stephen J. Turnbull wrote:
Martin> With the UTF-8-SIG codec, it would apply to all operation
Martin> modes of the codec, whether stream-based or from strings.
I had in mind the ability to treat a string as a stream.
Hmm. A string is not a stream, but it could be the contents of a stream.
A
>>"MAL" == M <[EMAIL PROTECTED]> writes:
MAL> Stephen J. Turnbull wrote:
>> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it
>> even adds them to existing UTF-8 files lacking them.
MAL> Is that a MS application ? AFAIK, notepad, wordpad and MS
MAL> Office alwa
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:
Martin> Stephen J. Turnbull wrote:
>> However, this option should be part of the initialization of an
>> IO stream which produces Unicodes, _not_ an operation on
>> arbitrary internal strings (whether raw or Unicode).
Stephen J. Turnbull wrote:
>>"MAL" == M <[EMAIL PROTECTED]> writes:
>
>
> MAL> The BOM (byte order mark) was a non-standard Microsoft
> MAL> invention to detect Unicode text data as such (MS always uses
> MAL> UTF-16-LE for Unicode text files).
>
> The Japanese "memopado" (Notep
M.-A. Lemburg wrote:
[...]
With the UTF-8-SIG codec, it would apply to all operation modes of
the codec, whether stream-based or from strings. Whether or not to
use the codec would be the application's choice.
I'd suggest to use the same mode of operation as we have in
the UTF-16 codec: it removes
Martin v. Löwis wrote:
> Stephen J. Turnbull wrote:
>
>> So there is a standard for the UTF-8 signature, and I know of
>> applications which produce it. While I agree with you that Python's
>> codecs shouldn't produce it (by default), providing an option to strip
>> is a good idea.
>
> I would p
Stephen J. Turnbull wrote:
So there is a standard for the UTF-8 signature, and I know of
applications which produce it. While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.
I would personally like to see an "utf-8-bom" codec (p
> "MAL" == M <[EMAIL PROTECTED]> writes:
MAL> The BOM (byte order mark) was a non-standard Microsoft
MAL> invention to detect Unicode text data as such (MS always uses
MAL> UTF-16-LE for Unicode text files).
The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds
th
On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote:
The BOM (byte order mark) was a non-standard Microsoft invention
to detect Unicode text data as such (MS always uses UTF-16-LE for
Unicode text files).
Well, it's origins do not really matter since at this point the BOM is
firmly encoded in the Unicod
Evan Jones wrote:
> I recently rediscovered this strange behaviour in Python's Unicode
> handling. I *think* it is a bug, but before I go and try to hack
> together a patch, I figure I should run it by the experts here on
> Python-Dev. If you understand Unicode, please let me know if there are
> pr
I recently rediscovered this strange behaviour in Python's Unicode
handling. I *think* it is a bug, but before I go and try to hack
together a patch, I figure I should run it by the experts here on
Python-Dev. If you understand Unicode, please let me know if there are
problems with making these
35 matches
Mail list logo