Re: Detection of unlabeled UTF-8

Aryeh Gregor Fri, 30 Aug 2013 06:33:11 -0700

On Fri, Aug 30, 2013 at 1:03 PM, Henri Sivonen <hsivo...@hsivonen.fi> wrote:
> This is true if  you run the heuristic over the entire byte stream.
> Unfortunately, since we support incremental loading of HTML (and will
> have to continue to do so), we don't have the entire byte stream
> available at the time when we need to make a decision of what encoding
> to assume.


In particular, you need to decide on the encoding before you start
running any user script, because you don't want document.characterSet
etc. to change once it might have already been accessed.  For
performance reasons, we want to be able to run scripts immediately
after receiving the initial TCP response, if there are any to run yet.
 This implies we need to decide on character set after reading the
first segment, which typically will not contain the actual content of
the page that we would want to sniff on pages like
http://www.eyrie-productions.com/.  Right?

(I say this only because my initial reaction was that we could hold
off on deciding what encoding to use until we find the first non-ASCII
byte without any ill effects, if we really wanted to.  That would
probably make the site in question work.  But then I realized it would
break document.characterSet, so it's not an option even if we wanted
more sniffing.)
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to