On Thu, Aug 29, 2013 at 9:41 PM, Zack Weinberg <za...@panix.com> wrote: > All the discussion of fallback character encodings has reminded me of an > issue I've been meaning to bring up for some time: As a user of the en-US > localization, nowadays the overwhelmingly most common situation where I see > mojibake is when a site puts UTF-8 in its pages without declaring any > encoding at all (neither via <meta charset> nor Content-Type).
Telemetry data suggests that these days the more common reason for seeing mojibake is that there is an encoding declaration but it is wrong. My guess is that this arises from Linux distributions silently changing their Apache defaults to send a charset parameter in ContentType on the theory that it's good for security to send one even if the person packaging Apache logically can have no clue of what the value of the parameter should be for a specific deployment. (I think we should not start second guessing encoding declarations.) > It is > possible to distinguish UTF-8 from most legacy encodings heuristically with > high reliability, and I'd like to suggest that we ought to do so, > independent of locale. This is true if you run the heuristic over the entire byte stream. Unfortunately, since we support incremental loading of HTML (and will have to continue to do so), we don't have the entire byte stream available at the time when we need to make a decision of what encoding to assume. > Having read through a bunch of the "fallback encoding is wrong" bugs Henri's > been filing, I have the impression that Henri would prefer we *not* detect > UTF-8 Correct. Every time a localization sets the fallback to UTF-8 or a heuristic detector detects unlabeled UTF-8 is an opportunity for Web authors to generate a new legacy of unlabeled UTF-8 content thinking that everything is okay. > 1. There exist sites that still regularly add new, UTF-8-encoded content, > but whose *structure* was laid down in the late 1990s or early 2000s, > declares no encoding, and is unlikely ever to be updated again. The example > I have to hand is > http://www.eyrie-productions.com/Forum/dcboard.cgi?az=read_count&om=138&forum=DCForumID24&viewmode=threaded > ; many other posts on this forum have the same problem. Take note of the > vintage HTML. I suggested to the admins of this site that they add <meta > charset="utf-8"> to the master page template, and was told that no one > involved in current day-to-day operations has the necessary access > privileges. I suspect that this kind of situation is rather more common than > we would like to believe. It's easy to have an anecdotal single data point of something on the Web are being broken. Is there any data on how common this problem is relative to other legacy encoding phenomena? > 2. For some of the fallback-encoding-is-wrong bugs still open, a binary > UTF-8/unibyte heuristic would save the localization from having to choose > between displaying legacy minority-language content correctly and displaying > legacy hegemonic-language content correctly. If I understand correctly, this > is the case at least for Welsh: > https://bugzilla.mozilla.org/show_bug.cgi?id=844087 . If we hadn't been defaulting to UTF-8 in any localization at any point, the minority-language unlabeled UTF-8 legacy would not have had a chance to develop. It's terrible that after having made the initial mistake of letting unlabeled non-UTF-8 legacy to develop, the mistake has been repeated for some localizations to allow a legacy of unlabeled UTF-8 to develop. We might still have a chance of stopping the new legacy of unlabeled UTF-8 from developing. > 3. Files loaded from local disk have no encoding metadata from the > transport, and may have no in-band label either; in particular, UTF-8 plain > text with no byte order mark, which is increasingly common, should not be > misidentified as the legacy encoding. When accessing the local disk, it might indeed make sense to examine all the bytes of the file before starting parsing. > Having a binary UTF-8/unibyte > heuristic might address some of the concerns mentioned in the "File API > should not use 'universal' character detection" bug, > https://bugzilla.mozilla.org/show_bug.cgi?id=848842 . I think in the case of the File API, we should just implement what the spec says and assume UTF-8. I think it's reprehensible that we have pulled non-spec magic out of thin air here. > If people are concerned about "infecting" the modern platform with > heuristics, perhaps we could limit application of the heuristic to quirks > mode, for HTML delivered over HTTP. I'm not particularly happy about the prospect of having to change the order of the quirkiness determination and the encoding determination. On Fri, Aug 30, 2013 at 11:40 AM, Gervase Markham <g...@mozilla.org> wrote: > That seems wise to me, on gut instinct. It looks to me that it was gut instinct that led to stuff like the Esperanto locale setting the fallback to UTF-8 thereby making the locale top the list of character encoding overwrite usage frequency. Gut says "UTF-8 is good, ergo default to UTF-8". I think the latter doesn't follow from the former on the consumption side when legacy exists. I think it does follow on the content generation side. But are we doing it on the content generation side? Of course not! https://bugzilla.mozilla.org/show_bug.cgi?id=862292 > If the web is moving to UTF-8, > and we are trying to encourage that, I think we should encourage Web authors to use UTF-8 *and* to *declare* it. > We don't want people to try and move to UTF-8, but move back because > they haven't figured out how (or are technically unable) to label it > correctly and "it comes out all wrong". Trying to make UTF-8 work magically would mean that people think that their stuff works if the heuristic happens to guess right at testing time, but things could still break at deployment time when content changes. (Where "breaks" includes: "Page starts loading as non-UTF-8 and gets reloaded as UTF-8 in mid-parse flashing layout and re-running the side effects of scripts.") Besides, having to declare "this is not legacy content and its author has clue about the current state of how things work" is the way the Web works and makes progress. It's not like having to declare UTF-8 is substantially different from having to declare the standards mode or having to declare a mobile-friendly viewport. The current non-legacy clue indicator is: <!DOCTYPE html><meta charset=utf-8><meta content="width=device-width, initial-scale=1" name="viewport"> (Instead of the <meta charset=utf-8> bit, you could alternatively put the BOM before the <!DOCTYPE html> bit.) The boilerplate is getting longer, but it's fundamental to the nature of the legacy problem that legacy content doesn't declare that it's legacy, so non-legacy content has to declare that it's not legacy. -- Henri Sivonen hsivo...@hsivonen.fi http://hsivonen.iki.fi/ _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform