This seems reasonable, but 50M is a pretty large number. Given the odds of UTF-8 detection failing, I would have thought that this could be much lower. What is the number in Chrome?
I assume that other local sources like chrome: are expected to be annotated properly. On Mon, Dec 10, 2018 at 11:28 PM Henri Sivonen <hsivo...@mozilla.com> wrote: > > (Note: This isn't really a Web-exposed feature, but this is a Web > developer-exposed feature.) > > # Summary > > Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!). > > Some Web developers like to develop locally from file: URLs (as > opposed to local HTTP server) and then deploy using a Web server that > declares charset=UTF-8. To get the same convenience as when developing > with Chrome, they want the files loaded from file: URLs be treated as > UTF-8 even though the HTTP header isn't there. > > Non-developer users save files from the Web verbatim without the HTTP > headers and open the files from file: URLs. These days, those files > are most often in UTF-8 and lack the BOM, and sometimes they lack > <meta charset=utf-8>, and plain text files can't even use <meta > charset=utf-8>. These users, too, would like a Chrome-like convenience > when opening these files from file: URLs in Firefox. > > # Details > > If a HTML or plain text file loaded from a file: URL does not contain > a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely > improbable for text intended to be in a non-UTF-8 encoding to look > like valid UTF-8 on the byte level.) Otherwise, behave like at > present: assume the fallback legacy encoding, whose default depends on > the Firefox UI locale. > > The 50 MB limit exists to avoid buffering everything when loading a > log file whose size is on the order of a gigabyte. 50 MB is an > arbitrary size that is significantly larger than "normal" HTML or text > files, so that "normal"-sized files are examined with 100% confidence > (i.e. the whole file is examined) but can be assumed to fit in RAM > even on computers that only have a couple of gigabytes of RAM. > > The limit, despite being arbitrary, is checked exactly to avoid > visible behavior changes depending on how Necko chooses buffer > boundaries. > > The limit is a number of bytes instead of a timeout in order to avoid > reintroducing timing dependencies (carefully removed in Firefox 4) to > HTML parsing--even for file: URLs. > > Unless a <meta> declaring the encoding (or a BOM) is found within the > first 1024 bytes, up to 50 MB of input is buffered before starting > tokenizing. That is, the feature assumes that local files don't need > incremental HTML parsing, that local file streams don't stall as part > of their intended operation, and that the content of local files is > available in its entirety (approximately) immediately. > > There are counter examples like Unix FIFOs (can be infinite and can > stall for an arbitrary amount of time) or file server shares mounted > as if they were local disks (data available somewhat less > immediately). It is assumed that it's OK to require people who have > built workflows around Unix FIFOs to use <meta charset> and that it's > OK to potentially start rendering a little later when file: URLs > actually cause network access. > > UTF-8 autodetection is given lower precedence that all other signals > that are presently considered for file: URLs. In particular, if a > file:-URL HTML document frames another file: URL HTML document (i.e. > they count as same-origin), the child inherits the encoding from the > parent instead of UTF-8 autodetection getting applied in the child > frame. > > # Why file: URLs only > > The reason why the feature does not apply to http: or https: resources > is that in those cases, it really isn't OK to assume that all bytes > arrive so quickly as to not benefit from incremental rendering and it > isn't OK to assume that the stream doesn't intentionally stall. > > Applying detection to http: or https: resources would mean at least on > of the following compromises: > > * Making the detection unreliable by making it depend on non-ASCII > appearing in the first 1024 bytes (the number of bytes currently > buffered for scanning <meta>). If the <title> was always near the > start of the file and the natural language used a non-Latin script to > make non-ASCII in the <title> a certainty, this solution would be > reliable. However, this solution would be particularly bad for > Latin-script languages with infrequent non-ASCII, such as Finnish or > German, which can legitimately have all-ASCII titles despite the > language as a whole including non-ASCII. That is, if a developer > tested a site with a title that has some non-ASCII, things would > appear to work, but then the site would break when an all-ASCII title > occurs. > > * Making results depend on timing. (Having a detection timeout would > make the results depend on network performance relative to wall-clock > time.) > > * Making the detection unreliable by examining only the first buffer > passed by the networking subsystem to the HTML parser. This makes the > result dependent on network buffer boundaries (*and* potentially > timing to the extent timing affects the boundaries), which is > unreliable. Prior to Firefox 4, HTML parsing in Firefox depended on > network buffer boundaries, which was bad and was remedied in Firefox > 4. According to > https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 , > Chrome chooses this mode of badness. > > * Breaking incremental rendering. (Not acceptable for remote content > for user-perceived performance reasons.) This is what the solution for > file: URLs does on the assumption that it's OK, because the data in > its entirety is (approximately) immediately available. > > * Causing reloads. This is the mode of badness that applies when our > Japanese detector is in use and the first 1024 aren't enough to make > the decision. > > All of these are bad. It's better to make the failure to declare UTF-8 > in the http/https case something that the Web developer obviously has > to fix (by adding <meta>, HTTP header or the BOM) than to make it > appear that things work when actually at least one of the above forms > of badness applies. > > # Bug > > https://bugzilla.mozilla.org/show_bug.cgi?id=1071816 > > # Link to standard > > https://html.spec.whatwg.org/#determining-the-character-encoding step > 7 is basically an "anything goes" step for legacy reasons--mainly to > allow Japanese encoding detection that IE, WebKit and Gecko had before > the spec was written. Chrome started detecting more without prior > standard-setting discussion. See > https://github.com/whatwg/encoding/issues/68 for after-the-fact > discussion. > > # Platform coverage > > All > > # Estimated or target release > > 66 > > # Preference behind which this will be implemented > > Not planning to have a pref for this. > > # Is this feature enabled by default in sandboxed iframes? > > This is implemented to apply to all non-resource:-URL-derived file: > URLs, but since same-origin inheritance to child frames takes > precedence, this isn't expected to apply to sandboxed iframes in > practice. > > # DevTools bug > > No new dev tools integration. The pre-existing console warning about > undeclared character encoding will be shown still in the autodetection > case. > > # Do other browser engines implement this > > Chrome does, but not with the same number of bytes examined. > > Safari as of El Capitan (my Mac is stuck on El Capitan) doesn't. > > Edge as of Windows 10 1803 doesn't. > > # web-platform-tests > > As far as I'm aware, WPT doesn't cover file: URL behavior, and there > isn't a proper spec for this. Hence, unit tests use mochitest-chrome. > > # Is this feature restricted to secure contexts? > > Restricted to file: URLs. > > -- > Henri Sivonen > hsivo...@mozilla.com > _______________________________________________ > dev-platform mailing list > dev-platform@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-platform _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform