On 8/6/19 8:16 AM, Anne van Kesteren wrote:
> On Sat, Jul 20, 2019 at 2:05 AM Jeff Walden <[email protected]> wrote:
>> (*Only* valid UTF-8: any invalidity, including for WTF-8, is an immediate
>> error, no replacement-character semantics applied.)
>
> Wouldn't adding this allow you to largely bypass the text decoder if
> it identifies the content to be UTF-8? Meaning we'd have to scan and
> copy bytes even less?
In principle "yes"; in current reality "no".
First and foremost, as our script-loading/networking code is set up now there's
an inevitable copy. When we load an HTTP channel we process its data using a
|NS_NewIncrementalStreamLoader| that ultimately invokes this signature:
NS_IMETHODIMP
ScriptLoadHandler::OnIncrementalData(nsIIncrementalStreamLoader* aLoader,
nsISupports* aContext,
uint32_t aDataLength, const uint8_t* aData,
uint32_t* aConsumedLength) {
This function is *provided* a prefilled (probably by NSPR, ultimately by
network driver/card?) buffer, then we use an |intl::Decoder| to
decode-by-copying that buffer's bytes into the script loader's JS
allocator-allocated buffer, reallocating and expanding it as necessary. (The
buffer must be JS-allocated because it's transferred to the JS engine for
parsing/compilation/execution. If you don't transfer a JS-owned buffer,
SpiderMonkey makes a fresh copy anyway.) To avoid a copy, you'd need to
intersperse decoding (and buffer-expanding) code into the networking layer --
theoretically doable, practically tricky (especially if we assume the buffer is
mindlessly filled by networking driver code).
Second -- alternatively -- if the JS engine literally processed raw code units
of utterly unknown validity and networking code could directly fill in the
buffer the JS engine would process --the JS engine would require additional
changes to handle invalid code units. UTF-16 already demands this because JS
is natively WTF-16 (and all 16-bit sequences are WTF-16). But all UTF-8
processing code assumes validity now. Extra effort would be required to handle
invalidity. We do a lot of keeping raw pointers/indexes into source units and
then using them later -- think for things like atomizing identifiers -- and all
those process-this-range-of-data operations would have to be modified.
*Possible*? Yes. Tricky? For sure. (Particularly as many of those
coordinates we end up baking into the final compiled script representation --
so invalidity wouldn't be a transient concern but one that would persistent
indefinitely.)
Third and maybe most important if the practical considerations didn't exist:
like every sane person, I'm leery of adding yet another place where arbitrary
ostensibly-UTF-8 bytes are decoded with any sort of
invalid-is-not-immediate-error semantics. In a very distant past I fixed an
Acid3 failure where we mis-implemented UTF-16 replacement-character semantics.
https://bugzilla.mozilla.org/show_bug.cgi?id=421576 New implementations of
replacement-character semantics are disproportionately risky. In fact when I
fixed that I failed to fix a separate UTF-16 decoding implementation that had
to be consistent with it, introducing a security bug.
https://bugzilla.mozilla.org/show_bug.cgi?id=489041#c9 *Some* of that problem
was because replacement-character stuff was at the time under-defined, and now
it's well-defined...but still. Risk. Once burned, twice shy.
In principle this is all solvable. But it's all rather complicated and
fraught. We can probably get better and safer wins making improvements
elsewhere.
Jeff
_______________________________________________
dev-platform mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-platform