(Note: This isn't really a Web-exposed feature, but this is a Web developer-exposed feature.)
# Summary Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!). Some Web developers like to develop locally from file: URLs (as opposed to local HTTP server) and then deploy using a Web server that declares charset=UTF-8. To get the same convenience as when developing with Chrome, they want the files loaded from file: URLs be treated as UTF-8 even though the HTTP header isn't there. Non-developer users save files from the Web verbatim without the HTTP headers and open the files from file: URLs. These days, those files are most often in UTF-8 and lack the BOM, and sometimes they lack <meta charset=utf-8>, and plain text files can't even use <meta charset=utf-8>. These users, too, would like a Chrome-like convenience when opening these files from file: URLs in Firefox. # Details If a HTML or plain text file loaded from a file: URL does not contain a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely improbable for text intended to be in a non-UTF-8 encoding to look like valid UTF-8 on the byte level.) Otherwise, behave like at present: assume the fallback legacy encoding, whose default depends on the Firefox UI locale. The 50 MB limit exists to avoid buffering everything when loading a log file whose size is on the order of a gigabyte. 50 MB is an arbitrary size that is significantly larger than "normal" HTML or text files, so that "normal"-sized files are examined with 100% confidence (i.e. the whole file is examined) but can be assumed to fit in RAM even on computers that only have a couple of gigabytes of RAM. The limit, despite being arbitrary, is checked exactly to avoid visible behavior changes depending on how Necko chooses buffer boundaries. The limit is a number of bytes instead of a timeout in order to avoid reintroducing timing dependencies (carefully removed in Firefox 4) to HTML parsing--even for file: URLs. Unless a <meta> declaring the encoding (or a BOM) is found within the first 1024 bytes, up to 50 MB of input is buffered before starting tokenizing. That is, the feature assumes that local files don't need incremental HTML parsing, that local file streams don't stall as part of their intended operation, and that the content of local files is available in its entirety (approximately) immediately. There are counter examples like Unix FIFOs (can be infinite and can stall for an arbitrary amount of time) or file server shares mounted as if they were local disks (data available somewhat less immediately). It is assumed that it's OK to require people who have built workflows around Unix FIFOs to use <meta charset> and that it's OK to potentially start rendering a little later when file: URLs actually cause network access. UTF-8 autodetection is given lower precedence that all other signals that are presently considered for file: URLs. In particular, if a file:-URL HTML document frames another file: URL HTML document (i.e. they count as same-origin), the child inherits the encoding from the parent instead of UTF-8 autodetection getting applied in the child frame. # Why file: URLs only The reason why the feature does not apply to http: or https: resources is that in those cases, it really isn't OK to assume that all bytes arrive so quickly as to not benefit from incremental rendering and it isn't OK to assume that the stream doesn't intentionally stall. Applying detection to http: or https: resources would mean at least on of the following compromises: * Making the detection unreliable by making it depend on non-ASCII appearing in the first 1024 bytes (the number of bytes currently buffered for scanning <meta>). If the <title> was always near the start of the file and the natural language used a non-Latin script to make non-ASCII in the <title> a certainty, this solution would be reliable. However, this solution would be particularly bad for Latin-script languages with infrequent non-ASCII, such as Finnish or German, which can legitimately have all-ASCII titles despite the language as a whole including non-ASCII. That is, if a developer tested a site with a title that has some non-ASCII, things would appear to work, but then the site would break when an all-ASCII title occurs. * Making results depend on timing. (Having a detection timeout would make the results depend on network performance relative to wall-clock time.) * Making the detection unreliable by examining only the first buffer passed by the networking subsystem to the HTML parser. This makes the result dependent on network buffer boundaries (*and* potentially timing to the extent timing affects the boundaries), which is unreliable. Prior to Firefox 4, HTML parsing in Firefox depended on network buffer boundaries, which was bad and was remedied in Firefox 4. According to https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 , Chrome chooses this mode of badness. * Breaking incremental rendering. (Not acceptable for remote content for user-perceived performance reasons.) This is what the solution for file: URLs does on the assumption that it's OK, because the data in its entirety is (approximately) immediately available. * Causing reloads. This is the mode of badness that applies when our Japanese detector is in use and the first 1024 aren't enough to make the decision. All of these are bad. It's better to make the failure to declare UTF-8 in the http/https case something that the Web developer obviously has to fix (by adding <meta>, HTTP header or the BOM) than to make it appear that things work when actually at least one of the above forms of badness applies. # Bug https://bugzilla.mozilla.org/show_bug.cgi?id=1071816 # Link to standard https://html.spec.whatwg.org/#determining-the-character-encoding step 7 is basically an "anything goes" step for legacy reasons--mainly to allow Japanese encoding detection that IE, WebKit and Gecko had before the spec was written. Chrome started detecting more without prior standard-setting discussion. See https://github.com/whatwg/encoding/issues/68 for after-the-fact discussion. # Platform coverage All # Estimated or target release 66 # Preference behind which this will be implemented Not planning to have a pref for this. # Is this feature enabled by default in sandboxed iframes? This is implemented to apply to all non-resource:-URL-derived file: URLs, but since same-origin inheritance to child frames takes precedence, this isn't expected to apply to sandboxed iframes in practice. # DevTools bug No new dev tools integration. The pre-existing console warning about undeclared character encoding will be shown still in the autodetection case. # Do other browser engines implement this Chrome does, but not with the same number of bytes examined. Safari as of El Capitan (my Mac is stuck on El Capitan) doesn't. Edge as of Windows 10 1803 doesn't. # web-platform-tests As far as I'm aware, WPT doesn't cover file: URL behavior, and there isn't a proper spec for this. Hence, unit tests use mochitest-chrome. # Is this feature restricted to secure contexts? Restricted to file: URLs. -- Henri Sivonen hsivo...@mozilla.com _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform