Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

Martin Thomson Mon, 10 Dec 2018 16:25:26 -0800

This seems reasonable, but 50M is a pretty large number.  Given the
odds of UTF-8 detection failing, I would have thought that this could
be much lower.  What is the number in Chrome?


I assume that other local sources like chrome: are expected to be
annotated properly.
On Mon, Dec 10, 2018 at 11:28 PM Henri Sivonen <hsivo...@mozilla.com> wrote:
>
> (Note: This isn't really a Web-exposed feature, but this is a Web
> developer-exposed feature.)
>
> # Summary
>
> Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!).
>
> Some Web developers like to develop locally from file: URLs (as
> opposed to local HTTP server) and then deploy using a Web server that
> declares charset=UTF-8. To get the same convenience as when developing
> with Chrome, they want the files loaded from file: URLs be treated as
> UTF-8 even though the HTTP header isn't there.
>
> Non-developer users save files from the Web verbatim without the HTTP
> headers and open the files from file: URLs. These days, those files
> are most often in UTF-8 and lack the BOM, and sometimes they lack
> <meta charset=utf-8>, and plain text files can't even use <meta
> charset=utf-8>. These users, too, would like a Chrome-like convenience
> when opening these files from file: URLs in Firefox.
>
> # Details
>
> If a HTML or plain text file loaded from a file: URL does not contain
> a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely
> improbable for text intended to be in a non-UTF-8 encoding to look
> like valid UTF-8 on the byte level.) Otherwise, behave like at
> present: assume the fallback legacy encoding, whose default depends on
> the Firefox UI locale.
>
> The 50 MB limit exists to avoid buffering everything when loading a
> log file whose size is on the order of a gigabyte. 50 MB is an
> arbitrary size that is significantly larger than "normal" HTML or text
> files, so that "normal"-sized files are examined with 100% confidence
> (i.e. the whole file is examined) but can be assumed to fit in RAM
> even on computers that only have a couple of gigabytes of RAM.
>
> The limit, despite being arbitrary, is checked exactly to avoid
> visible behavior changes depending on how Necko chooses buffer
> boundaries.
>
> The limit is a number of bytes instead of a timeout in order to avoid
> reintroducing timing dependencies (carefully removed in Firefox 4) to
> HTML parsing--even for file: URLs.
>
> Unless a <meta> declaring the encoding (or a BOM) is found within the
> first 1024 bytes, up to 50 MB of input is buffered before starting
> tokenizing. That is, the feature assumes that local files don't need
> incremental HTML parsing, that local file streams don't stall as part
> of their intended operation, and that the content of local files is
> available in its entirety (approximately) immediately.
>
> There are counter examples like Unix FIFOs (can be infinite and can
> stall for an arbitrary amount of time) or file server shares mounted
> as if they were local disks (data available somewhat less
> immediately). It is assumed that it's OK to require people who have
> built workflows around Unix FIFOs to use <meta charset> and that it's
> OK to potentially start rendering a little later when file: URLs
> actually cause network access.
>
> UTF-8 autodetection is given lower precedence that all other signals
> that are presently considered for file: URLs. In particular, if a
> file:-URL HTML document frames another file: URL HTML document (i.e.
> they count as same-origin), the child inherits the encoding from the
> parent instead of UTF-8 autodetection getting applied in the child
> frame.
>
> # Why file: URLs only
>
> The reason why the feature does not apply to http: or https: resources
> is that in those cases, it really isn't OK to assume that all bytes
> arrive so quickly as to not benefit from incremental rendering and it
> isn't OK to assume that the stream doesn't intentionally stall.
>
> Applying detection to http: or https: resources would mean at least on
> of the following compromises:
>
> * Making the detection unreliable by making it depend on non-ASCII
> appearing in the first 1024 bytes (the number of bytes currently
> buffered for scanning <meta>). If the <title> was always near the
> start of the file and the natural language used a non-Latin script to
> make non-ASCII in the <title> a certainty, this solution would be
> reliable. However, this solution would be particularly bad for
> Latin-script languages with infrequent non-ASCII, such as Finnish or
> German, which can legitimately have all-ASCII titles despite the
> language as a whole including non-ASCII. That is, if a developer
> tested a site with a title that has some non-ASCII, things would
> appear to work, but then the site would break when an all-ASCII title
> occurs.
>
> * Making results depend on timing. (Having a detection timeout would
> make the results depend on network performance relative to wall-clock
> time.)
>
> * Making the detection unreliable by examining only the first buffer
> passed by the networking subsystem to the HTML parser. This makes the
> result dependent on network buffer boundaries (*and* potentially
> timing to the extent timing affects the boundaries), which is
> unreliable. Prior to Firefox 4, HTML parsing in Firefox depended on
> network buffer boundaries, which was bad and was remedied in Firefox
> 4. According to
> https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 ,
> Chrome chooses this mode of badness.
>
> * Breaking incremental rendering. (Not acceptable for remote content
> for user-perceived performance reasons.) This is what the solution for
> file: URLs does on the assumption that it's OK, because the data in
> its entirety is (approximately) immediately available.
>
> * Causing reloads. This is the mode of badness that applies when our
> Japanese detector is in use and the first 1024 aren't enough to make
> the decision.
>
> All of these are bad. It's better to make the failure to declare UTF-8
> in the http/https case something that the Web developer obviously has
> to fix (by adding <meta>, HTTP header or the BOM) than to make it
> appear that things work when actually at least one of the above forms
> of badness applies.
>
> # Bug
>
> https://bugzilla.mozilla.org/show_bug.cgi?id=1071816
>
> # Link to standard
>
> https://html.spec.whatwg.org/#determining-the-character-encoding step
> 7 is basically an "anything goes" step for legacy reasons--mainly to
> allow Japanese encoding detection that IE, WebKit and Gecko had before
> the spec was written. Chrome started detecting more without prior
> standard-setting discussion. See
> https://github.com/whatwg/encoding/issues/68 for after-the-fact
> discussion.
>
> # Platform coverage
>
> All
>
> # Estimated or target release
>
> 66
>
> # Preference behind which this will be implemented
>
> Not planning to have a pref for this.
>
> # Is this feature enabled by default in sandboxed iframes?
>
> This is implemented to apply to all non-resource:-URL-derived file:
> URLs, but since same-origin inheritance to child frames takes
> precedence, this isn't expected to apply to sandboxed iframes in
> practice.
>
> # DevTools bug
>
> No new dev tools integration. The pre-existing console warning about
> undeclared character encoding will be shown still in the autodetection
> case.
>
> # Do other browser engines implement this
>
> Chrome does, but not with the same number of bytes examined.
>
> Safari as of El Capitan (my Mac is stuck on El Capitan) doesn't.
>
> Edge as of Windows 10 1803 doesn't.
>
> # web-platform-tests
>
> As far as I'm aware, WPT doesn't cover file: URL behavior, and there
> isn't a proper spec for this. Hence, unit tests use mochitest-chrome.
>
> # Is this feature restricted to secure contexts?
>
> Restricted to file: URLs.
>
> --
> Henri Sivonen
> hsivo...@mozilla.com
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

Reply via email to