Re: [dev-servo] HTML parser-related datatypes

2012-10-11 Thread Zack Weinberg

On 2012-10-10 9:03 PM, Johnny Stenback wrote:

Hey Henri,

On 10/10/2012 5:51 AM, Henri Sivonen wrote:

I am researching/prototyping a translation of the same HTML parser we
use in Gecko into Rust for use in Servo. Should the HTML parser in
Servo operate on UTF-8, UTF-16 or CESU-8? What will the DOM in Servo
use internally?


Excellent question. This is something that has been discussed before
(there was a discussion at the previous DOM/WebAPI work week about
this), and it's my personal belief that having the DOM use anything
other than what JS requires, i.e. UTF-16 (or UCS2, really), is a
non-starter from a performance point of view. As you point out below,
having to copy string data on every string access from JS would kill us
on performance. Ironically we often do end up doing that conversion in
the current DOM implementation, but AFAIK that's fairly isolated to
accessing the text value of a DOM text nodes. I'd argue that that's a
relatively rare operation compared to accessing property/attribute
values during tree iteration etc. That I think is where it matters.


I'm not seeing why the JS engine has to use any particular 
representation internally just because JS's exposed semantics are 
defined in terms of UCS-2.  In fact, storing strings internally in UTF-8 
and mapping them to UCS-2 only when it actually makes a difference to 
the program might well be *faster* than storing everything internally in 
UTF-16.  And it would pave the way to providing alternative JS 
functionality that doesn't expose the nastier parts of UCS-2 (e.g. 
surrogate pairs).


(This presumes that part of Servo involves reimplementing the JS engine 
in Rust anyway.  Without that, I can imagine this being far too much 
work to bother with :)


zw
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] HTML parser-related datatypes

2012-10-11 Thread Boris Zbarsky

On 10/11/12 4:09 PM, Zack Weinberg wrote:

I'm not seeing why the JS engine has to use any particular
representation internally just because JS's exposed semantics are
defined in terms of UCS-2.


Well, because it's simpler and because it makes charAt() fast?


(This presumes that part of Servo involves reimplementing the JS engine
in Rust anyway.


At the moment it does not.  So far.

-Boris
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] HTML parser-related datatypes

2012-10-11 Thread David Herman
On Oct 11, 2012, at 2:01 PM, Boris Zbarsky  wrote:

> On 10/11/12 4:09 PM, Zack Weinberg wrote:
>> I'm not seeing why the JS engine has to use any particular
>> representation internally just because JS's exposed semantics are
>> defined in terms of UCS-2.
> 
> Well, because it's simpler and because it makes charAt() fast?

Simpler maybe, but since strings are immutable it's perfectly reasonable to 
have multiple internal string types and tag strings in the heap as being e.g. 
ASCII-only, so that they can be stored more compactly and still have fast 
accesses. It does mean more proliferation of string types, and string 
operations have to have multiple code paths, but I imagine for many 
applications it could have significant wins.

I can't remember whether other engines do this.

Dave

___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo