Re: Detection of unlabeled UTF-8

Neil Harris Fri, 06 Sep 2013 09:38:21 -0700

On 06/09/13 16:45, Robert Kaiser wrote:

Henri Sivonen schrieb:
Considering what Aryeh said earlier in this thread, do you have a
suggestion how to do that so that
> [...]
Hmm, do we have to treat the whole document as a consistent charset?Could we instead, if we don't know the charset, look at everyrendered-as-text node/attribute in the DOM tree and run some kind ofcharset detection on it?
May be a dumb idea but might avoid the problem on the parsing level.

Robert Kaiser

I think that would create a whole lot more problems than it would fix,and would be unworkable in practice.

Charset detection from content is a probabilistic matter at best, andtreating the document as many small snippets of text would not onlyincrease the probability of the detection algorithm getting it wrong foreach node, but also give a large number of opportunities per page for atleast one of those detections to go wrong.


-- N.


_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to