On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote:
IMO, encoding estimation is something that many web programs will
have
to deal with
Can you please explain why that is? Web programs should not normally
have the need to detect the encoding; instead, it should be specified
always - unless you are talking about browsers specifically, which
need to support web pages that specify the encoding incorrectly.
Two cases come immediately to mind: email and web forms.
When a web browser POSTs data, there is no standard way of
communicating which encoding it's using. There are some hints which
make it easier (accept-charset attributes, the encoding used to send
the page to the browser), but no guarantees.
Email is a smaller problem, because it usually has a helpful content-
type header, but that's no guarantee.
Now, at the moment, the only data I have to support this claim is my
experience with DrProject in non-English locations.
If I'm the only one who has had these sorts of problems, I'll go back
to "Unicode for Dummies".
so it might as well be built in; I would prefer the option
to run `text=input.encode('guess')` (or something similar) than
relying
on an external dependency or worse yet using a hand-rolled algorithm.
Ok, let me try differently then. Please feel free to post a patch to
bugs.python.org, and let other people rip it apart.
For example, I don't think it should be a codec, as I can't imagine it
working on streams.
As things frequently are, it seems like this is a much larger problem
that I originally believed.
I'll go back and take another look at the problem, then come back if
new revelations appear.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com