Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

Henri Sivonen Tue, 11 Dec 2018 00:09:21 -0800

On Tue, Dec 11, 2018 at 2:24 AM Martin Thomson <[email protected]> wrote:
> This seems reasonable, but 50M is a pretty large number.  Given the
> odds of UTF-8 detection failing, I would have thought that this could
> be much lower.


Consider the case of a document of ASCII text with a copyright sign in
the footer. I'd rather not make anyone puzzle over why the behavior of
the footer depends on how much text comes before the footer.

50 MB is intentionally extremely large relative to "normal" HTML and
text files so that the limit is reached approximately "never" unless
you open *huge* log files.

The HTML spec is about 11 MB these days, so that's existence proof
that a non-log-file HTML document can exceed 10 MB. Of course, the
limit doesn't need to be larger than present-day UTF-8 files but
larger than "normal"-sized *legacy* non-UTF-8 files.

It is quite possible that 50 MB is *too* large considering 32-bit
systems and what *other* allocations are proportional to the buffer
size, and I'm open to changing the limit to something smaller than 50
MB as long as it's still larger than "normal" non-UTF-8 HTML and text
files.

How about I change it to 5 MB on the assumption that that's still very
large relative to pre-UTF-8-era HTML and text file sizes?

> What is the number in Chrome?

It depends. It's unclear to me what exactly it depends on. Based on
https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 ,
I expect it to depend on some combination of file system, OS kernel
and Chromium IO library internals.

On Ubuntu 18.04 with ext4 on an SSD, the number is 64 KB. On Windows
10 1803 with NTFS on an SSD, it's something smaller.

I think making the limit depend on the internals of file IO buffering
instead of a constant in the HTML parser is a really bad idea. Also 64
KB or something less than 64 KB seem way too small for the purpose of
making it so that the user approximately never needs to puzzle over
why things are different based on the length of the ASCII prefix of a
file with non-ASCII later in the file.

> I assume that other local sources like chrome: are expected to be
> annotated properly.

>From source inspection, it seems that chrome: URLs already get
hard-coded to UTF-8 on the channel level:
https://searchfox.org/mozilla-central/source/chrome/nsChromeProtocolHandler.cpp#187

As part of developing the patch, I saw only resource: URLs showing up
as file: URLs to the HTML parser, so only resource: URLs got a special
check that fast-tracks them to UTF-8 instead of buffering for
detection like normal file: URLs.

> On Mon, Dec 10, 2018 at 11:28 PM Henri Sivonen <[email protected]> wrote:
> >
> > (Note: This isn't really a Web-exposed feature, but this is a Web
> > developer-exposed feature.)
> >
> > # Summary
> >
> > Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!).
> >
> > Some Web developers like to develop locally from file: URLs (as
> > opposed to local HTTP server) and then deploy using a Web server that
> > declares charset=UTF-8. To get the same convenience as when developing
> > with Chrome, they want the files loaded from file: URLs be treated as
> > UTF-8 even though the HTTP header isn't there.
> >
> > Non-developer users save files from the Web verbatim without the HTTP
> > headers and open the files from file: URLs. These days, those files
> > are most often in UTF-8 and lack the BOM, and sometimes they lack
> > <meta charset=utf-8>, and plain text files can't even use <meta
> > charset=utf-8>. These users, too, would like a Chrome-like convenience
> > when opening these files from file: URLs in Firefox.
> >
> > # Details
> >
> > If a HTML or plain text file loaded from a file: URL does not contain
> > a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely
> > improbable for text intended to be in a non-UTF-8 encoding to look
> > like valid UTF-8 on the byte level.) Otherwise, behave like at
> > present: assume the fallback legacy encoding, whose default depends on
> > the Firefox UI locale.
> >
> > The 50 MB limit exists to avoid buffering everything when loading a
> > log file whose size is on the order of a gigabyte. 50 MB is an
> > arbitrary size that is significantly larger than "normal" HTML or text
> > files, so that "normal"-sized files are examined with 100% confidence
> > (i.e. the whole file is examined) but can be assumed to fit in RAM
> > even on computers that only have a couple of gigabytes of RAM.
> >
> > The limit, despite being arbitrary, is checked exactly to avoid
> > visible behavior changes depending on how Necko chooses buffer
> > boundaries.
> >
> > The limit is a number of bytes instead of a timeout in order to avoid
> > reintroducing timing dependencies (carefully removed in Firefox 4) to
> > HTML parsing--even for file: URLs.
> >
> > Unless a <meta> declaring the encoding (or a BOM) is found within the
> > first 1024 bytes, up to 50 MB of input is buffered before starting
> > tokenizing. That is, the feature assumes that local files don't need
> > incremental HTML parsing, that local file streams don't stall as part
> > of their intended operation, and that the content of local files is
> > available in its entirety (approximately) immediately.
> >
> > There are counter examples like Unix FIFOs (can be infinite and can
> > stall for an arbitrary amount of time) or file server shares mounted
> > as if they were local disks (data available somewhat less
> > immediately). It is assumed that it's OK to require people who have
> > built workflows around Unix FIFOs to use <meta charset> and that it's
> > OK to potentially start rendering a little later when file: URLs
> > actually cause network access.
> >
> > UTF-8 autodetection is given lower precedence that all other signals
> > that are presently considered for file: URLs. In particular, if a
> > file:-URL HTML document frames another file: URL HTML document (i.e.
> > they count as same-origin), the child inherits the encoding from the
> > parent instead of UTF-8 autodetection getting applied in the child
> > frame.
> >
> > # Why file: URLs only
> >
> > The reason why the feature does not apply to http: or https: resources
> > is that in those cases, it really isn't OK to assume that all bytes
> > arrive so quickly as to not benefit from incremental rendering and it
> > isn't OK to assume that the stream doesn't intentionally stall.
> >
> > Applying detection to http: or https: resources would mean at least on
> > of the following compromises:
> >
> > * Making the detection unreliable by making it depend on non-ASCII
> > appearing in the first 1024 bytes (the number of bytes currently
> > buffered for scanning <meta>). If the <title> was always near the
> > start of the file and the natural language used a non-Latin script to
> > make non-ASCII in the <title> a certainty, this solution would be
> > reliable. However, this solution would be particularly bad for
> > Latin-script languages with infrequent non-ASCII, such as Finnish or
> > German, which can legitimately have all-ASCII titles despite the
> > language as a whole including non-ASCII. That is, if a developer
> > tested a site with a title that has some non-ASCII, things would
> > appear to work, but then the site would break when an all-ASCII title
> > occurs.
> >
> > * Making results depend on timing. (Having a detection timeout would
> > make the results depend on network performance relative to wall-clock
> > time.)
> >
> > * Making the detection unreliable by examining only the first buffer
> > passed by the networking subsystem to the HTML parser. This makes the
> > result dependent on network buffer boundaries (*and* potentially
> > timing to the extent timing affects the boundaries), which is
> > unreliable. Prior to Firefox 4, HTML parsing in Firefox depended on
> > network buffer boundaries, which was bad and was remedied in Firefox
> > 4. According to
> > https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 ,
> > Chrome chooses this mode of badness.
> >
> > * Breaking incremental rendering. (Not acceptable for remote content
> > for user-perceived performance reasons.) This is what the solution for
> > file: URLs does on the assumption that it's OK, because the data in
> > its entirety is (approximately) immediately available.
> >
> > * Causing reloads. This is the mode of badness that applies when our
> > Japanese detector is in use and the first 1024 aren't enough to make
> > the decision.
> >
> > All of these are bad. It's better to make the failure to declare UTF-8
> > in the http/https case something that the Web developer obviously has
> > to fix (by adding <meta>, HTTP header or the BOM) than to make it
> > appear that things work when actually at least one of the above forms
> > of badness applies.
> >
> > # Bug
> >
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1071816
> >
> > # Link to standard
> >
> > https://html.spec.whatwg.org/#determining-the-character-encoding step
> > 7 is basically an "anything goes" step for legacy reasons--mainly to
> > allow Japanese encoding detection that IE, WebKit and Gecko had before
> > the spec was written. Chrome started detecting more without prior
> > standard-setting discussion. See
> > https://github.com/whatwg/encoding/issues/68 for after-the-fact
> > discussion.
> >
> > # Platform coverage
> >
> > All
> >
> > # Estimated or target release
> >
> > 66
> >
> > # Preference behind which this will be implemented
> >
> > Not planning to have a pref for this.
> >
> > # Is this feature enabled by default in sandboxed iframes?
> >
> > This is implemented to apply to all non-resource:-URL-derived file:
> > URLs, but since same-origin inheritance to child frames takes
> > precedence, this isn't expected to apply to sandboxed iframes in
> > practice.
> >
> > # DevTools bug
> >
> > No new dev tools integration. The pre-existing console warning about
> > undeclared character encoding will be shown still in the autodetection
> > case.
> >
> > # Do other browser engines implement this
> >
> > Chrome does, but not with the same number of bytes examined.
> >
> > Safari as of El Capitan (my Mac is stuck on El Capitan) doesn't.
> >
> > Edge as of Windows 10 1803 doesn't.
> >
> > # web-platform-tests
> >
> > As far as I'm aware, WPT doesn't cover file: URL behavior, and there
> > isn't a proper spec for this. Hence, unit tests use mochitest-chrome.
> >
> > # Is this feature restricted to secure contexts?
> >
> > Restricted to file: URLs.
> >
> > --
> > Henri Sivonen
> > [email protected]
> > _______________________________________________
> > dev-platform mailing list
> > [email protected]
> > https://lists.mozilla.org/listinfo/dev-platform



-- 
Henri Sivonen
[email protected]
_______________________________________________
dev-platform mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-platform

Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

Reply via email to