Re: Detection of unlabeled UTF-8

Henri Sivonen Fri, 30 Aug 2013 03:05:08 -0700

On Thu, Aug 29, 2013 at 9:41 PM, Zack Weinberg <za...@panix.com> wrote:
> All the discussion of fallback character encodings has reminded me of an
> issue I've been meaning to bring up for some time: As a user of the en-US
> localization, nowadays the overwhelmingly most common situation where I see
> mojibake is when a site puts UTF-8 in its pages without declaring any
> encoding at all (neither via <meta charset> nor Content-Type).

Telemetry data suggests that these days the more common reason for
seeing mojibake is that there is an encoding declaration but it is
wrong.  My guess is that this arises from Linux distributions silently
changing their Apache defaults to send a charset parameter in
ContentType on the theory that it's good for security to send one even
if the person packaging Apache logically can have no clue of what the
value of the parameter should be for a specific deployment. (I think
we should not start second guessing encoding declarations.)

> It is
> possible to distinguish UTF-8 from most legacy encodings heuristically with
> high reliability, and I'd like to suggest that we ought to do so,
> independent of locale.

This is true if  you run the heuristic over the entire byte stream.
Unfortunately, since we support incremental loading of HTML (and will
have to continue to do so), we don't have the entire byte stream
available at the time when we need to make a decision of what encoding
to assume.

> Having read through a bunch of the "fallback encoding is wrong" bugs Henri's
> been filing, I have the impression that Henri would prefer we *not* detect
> UTF-8

Correct. Every time a localization sets the fallback to UTF-8 or a
heuristic detector detects unlabeled UTF-8 is an opportunity for Web
authors to generate a new legacy of unlabeled UTF-8 content thinking
that everything is okay.

> 1. There exist sites that still regularly add new, UTF-8-encoded content,
> but whose *structure* was laid down in the late 1990s or early 2000s,
> declares no encoding, and is unlikely ever to be updated again. The example
> I have to hand is
> http://www.eyrie-productions.com/Forum/dcboard.cgi?az=read_count&om=138&forum=DCForumID24&viewmode=threaded
> ; many other posts on this forum have the same problem. Take note of the
> vintage HTML. I suggested to the admins of this site that they add <meta
> charset="utf-8"> to the master page template, and was told that no one
> involved in current day-to-day operations has the necessary access
> privileges. I suspect that this kind of situation is rather more common than
> we would like to believe.

It's easy to have an anecdotal single data point of something on the
Web are being broken.  Is there any data on how common this problem is
relative to other legacy encoding phenomena?

> 2. For some of the fallback-encoding-is-wrong bugs still open, a binary
> UTF-8/unibyte heuristic would save the localization from having to choose
> between displaying legacy minority-language content correctly and displaying
> legacy hegemonic-language content correctly. If I understand correctly, this
> is the case at least for Welsh:
> https://bugzilla.mozilla.org/show_bug.cgi?id=844087 .

If we hadn't been defaulting to UTF-8 in any localization at any
point, the minority-language unlabeled UTF-8 legacy would not have had
a chance to develop. It's terrible  that after having made the initial
mistake of letting unlabeled non-UTF-8 legacy to develop, the mistake
has been repeated for some localizations to allow a legacy of
unlabeled UTF-8 to develop.  We might still have a chance of stopping
the new legacy of unlabeled UTF-8 from developing.

> 3. Files loaded from local disk have no encoding metadata from the
> transport, and may have no in-band label either; in particular, UTF-8 plain
> text with no byte order mark, which is increasingly common, should not be
> misidentified as the legacy encoding.

When accessing the local disk, it might indeed make sense to examine
all the bytes of the file before starting parsing.

> Having a binary UTF-8/unibyte
> heuristic might address some of the concerns mentioned in the "File API
> should not use 'universal' character detection" bug,
> https://bugzilla.mozilla.org/show_bug.cgi?id=848842 .

I think in the case of the File API, we should just implement what the
spec says and assume UTF-8. I think it's reprehensible that we have
pulled non-spec magic out of thin air here.

> If people are concerned about "infecting" the modern platform with
> heuristics, perhaps we could limit application of the heuristic to quirks
> mode, for HTML delivered over HTTP.

I'm not particularly happy about the prospect of having to change the
order of the quirkiness determination and the encoding determination.

On Fri, Aug 30, 2013 at 11:40 AM, Gervase Markham <g...@mozilla.org> wrote:
> That seems wise to me, on gut instinct.

It looks to me that it was gut instinct that led to stuff like the
Esperanto locale  setting the fallback to UTF-8 thereby  making the
locale top the list of character encoding overwrite usage frequency.

Gut says "UTF-8 is good, ergo default to UTF-8". I think the latter
doesn't follow from the former on the consumption side when legacy
exists. I think it does follow on the content generation side. But are
we doing it on the content generation side? Of course not!
https://bugzilla.mozilla.org/show_bug.cgi?id=862292

> If the web is moving to UTF-8,
> and we are trying to encourage that,

I think we should encourage Web authors to use UTF-8  *and* to *declare* it.

> We don't want people to try and move to UTF-8, but move back because
> they haven't figured out how (or are technically unable) to label it
> correctly and "it comes out all wrong".

Trying to make UTF-8 work magically would mean that people think that
their stuff works  if the heuristic happens to guess right at testing
time, but things could still break at deployment time when content
changes. (Where "breaks" includes: "Page starts loading as non-UTF-8
and gets reloaded as UTF-8 in mid-parse flashing layout and re-running
the side effects of scripts.")

Besides, having to declare "this is not legacy content and its author
has clue about the current state of how things work" is the way the
Web works and makes progress. It's not like having to declare UTF-8 is
substantially different from having to declare the standards mode or
having to declare a mobile-friendly viewport.

The current non-legacy clue indicator is:
<!DOCTYPE html><meta charset=utf-8><meta content="width=device-width,
initial-scale=1" name="viewport">

(Instead of the <meta charset=utf-8> bit, you could alternatively put
the BOM before the <!DOCTYPE html> bit.)

The boilerplate is getting longer, but  it's fundamental to the nature
of the legacy problem that legacy content doesn't declare that it's
legacy, so non-legacy content has to declare that it's not legacy.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
http://hsivonen.iki.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to