Re: Detection of unlabeled UTF-8

Adam Roach Mon, 09 Sep 2013 08:29:05 -0700

On 9/9/13 02:31, Henri Sivonen wrote:

We don't have telemetry for the question "How often are pages that are not
labeled as UTF-8, UTF-16 or anything that maps to their replacement
encoding according to the Encoding Standard and that contain non-ASCII
bytes in fact valid UTF-8?" How rare would the mislabeled UTF-8 case need
to be for you to consider the UI that you're proposing not worth it?

I'd think it would depend somewhat on the severity of the misencoding.For example, interpreting a page of UTF-8 as Windows-1252 isn'tgenerally going to completely ruin a page with the occasional accentedLatin character, although it will certainly be an obvious defect. I'd behappy to leave the situation be if this happened to fewer than 1% ofusers over a six week period.

On the other hand, misrendering a page of UTF-8 that consistspredominantly of a non-Latin character set is pretty catastrophic, andis going to tend to happen to the same subset of users over and overagain. For that situation, I think I'd like to see fewer than 0.1% ofusers who have a build that has been localized into a non-Latincharacter set impacted over a six-week period before I was happy leavingthings as-is.

However, we do have telemetry for the percentage of Firefox sessions in
which the  current character encoding override UI has been used at least
once. See https://bugzilla.mozilla.org/show_bug.cgi?id=906032 for the
results broken down by desktop versus Android and then by locale.

I don't think measuring the behavior those few people who know aboutthis feature is particularly relevant. The status quo works for them, bydefinition. I'm far more concerned about those users who get garbledpages and don't have the knowledge to do anything about it.

I would accept  a (performance-conscious) patch for gathering telemetry for
the UTF-8 question in the HTML parser.  However, I'm not volunteering to
write one myself immediately, because I have bugs on my todo list that have
been caused by previous attempts of Gecko developers to be well-intentioned
about DWIM and UI around character encodings. Gotta fix those first.

Great. I'll see if I can wedge in some time to put one together(although I'm similarly swamped, so I don't have a good timeframe forthis). If anyone else has time to roll one out, that would be even better.

Even non-automatic correction means authors can take the attitude that
getting the encoding wrong is no big deal since the fix is a click away for
the user.

I'll repeat that it's not our job to police the web. I'm firmly of theopinion that those developers who don't care about doing things rightwon't do them right no matter how big a stick you personally choose tobeat them with. On the other hand, I'm quite worried about collateraldamage to our users in your crusade to control publishers.

Give the publishers the tools to understand their errors, and the usersthe tools to use the web the way they want to use it. Those publisherswho aren't bad actors will correct their own behavior -- those who _are_bad actors aren't going to behave anyway. There's no point gettingauthoritarian about it and making the web a less accessible place as aconsequence.


--
Adam Roach
Principal Platform Engineer
a...@mozilla.com
+1 650 903 0800 x863
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to