Re: [dev-servo] character encoding in the HTML parser

James Graham Mon, 31 Mar 2014 04:21:16 -0700

On 10/03/14 23:54, Keegan McAllister wrote:

Should we implement character encoding detection [1] at the same time
as the rest of the HTML parser?  It seems to be separable; the only
design interactions I see are:


- The character decoder and script APIs can write into the same input
stream - The encoding can change during parsing [2], which can be
handled as a mostly-normal navigation

Also, should we follow Gecko in representing the input stream as a
queue of UTF-16 buffers?  With UTF-8 we would have about half as much
data to stream through the parser, and (on 64-bit) we could do
case-insensitive ASCII operations 8 characters at a time.

Most content [3] is UTF-8, and the trend is in that direction.  But
for non-UTF-8 content we would pay the price of two conversions, if
the ultimate product is UCS-2 DOM strings.  However I don't think
we've finalized our representation of DOM strings, and we might do
the UCS-2 conversion lazily [4].

Apart from the compat problems that Simon already mentioned (which Ithink are important), it's worth considering in what cases the parser islikely to be a bottleneck. I believe that in load-type situations theparser is almost never the performance limiting factor (assuming areasonably well optimised implementation); it's much more likely thatnetwork performance, layout, or scripting will dominate the time toload. On the other hand I think that there exist pages which do thingslike run innerHTML in performance-critical loops. Therefore I expect itmakes more sense for the parser to operate on the same string type asthe DOM than the same type as the network data.

_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] character encoding in the HTML parser

Reply via email to