On 10/03/14 23:54, Keegan McAllister wrote:
Should we implement character encoding detection [1] at the same time
as the rest of the HTML parser?  It seems to be separable; the only
design interactions I see are:

- The character decoder and script APIs can write into the same input
stream - The encoding can change during parsing [2], which can be
handled as a mostly-normal navigation

Also, should we follow Gecko in representing the input stream as a
queue of UTF-16 buffers?  With UTF-8 we would have about half as much
data to stream through the parser, and (on 64-bit) we could do
case-insensitive ASCII operations 8 characters at a time.

Most content [3] is UTF-8, and the trend is in that direction.  But
for non-UTF-8 content we would pay the price of two conversions, if
the ultimate product is UCS-2 DOM strings.  However I don't think
we've finalized our representation of DOM strings, and we might do
the UCS-2 conversion lazily [4].

Apart from the compat problems that Simon already mentioned (which I think are important), it's worth considering in what cases the parser is likely to be a bottleneck. I believe that in load-type situations the parser is almost never the performance limiting factor (assuming a reasonably well optimised implementation); it's much more likely that network performance, layout, or scripting will dominate the time to load. On the other hand I think that there exist pages which do things like run innerHTML in performance-critical loops. Therefore I expect it makes more sense for the parser to operate on the same string type as the DOM than the same type as the network data.
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Reply via email to