On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote:
IMO, encoding estimation is something that many web programs will have
to deal with
Can you please explain why that is? Web programs should not normally
have the need to detect the encoding; instead, it should be specified
always - unless you are talking about browsers specifically, which
need to support web pages that specify the encoding incorrectly.
Two cases come immediately to mind: email and web forms.
When a web browser POSTs data, there is no standard way of communicating which encoding it's using. There are some hints which make it easier (accept-charset attributes, the encoding used to send the page to the browser), but no guarantees. Email is a smaller problem, because it usually has a helpful content- type header, but that's no guarantee.

Now, at the moment, the only data I have to support this claim is my experience with DrProject in non-English locations. If I'm the only one who has had these sorts of problems, I'll go back to "Unicode for Dummies".

so it might as well be built in; I would prefer the option
to run `text=input.encode('guess')` (or something similar) than relying
on an external dependency or worse yet using a hand-rolled algorithm.
Ok, let me try differently then. Please feel free to post a patch to
bugs.python.org, and let other people rip it apart.
For example, I don't think it should be a codec, as I can't imagine it
working on streams.

As things frequently are, it seems like this is a much larger problem that I originally believed.

I'll go back and take another look at the problem, then come back if new revelations appear.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to