Re: [dev-servo] character encoding in the HTML parser

2014-04-22 Thread Keegan McAllister
/html5/blob/sse/src/tokenizer/buffer_queue.rs#L182-L252 - Original Message - From: "Keegan McAllister" To: "Luke Wagner" Cc: "Henri Sivonen" , "Boris Zbarsky" , mozilla-dev-se...@lists.mozilla.org, "Robert O'Callahan" , "Eddy B

Re: [dev-servo] character encoding in the HTML parser

2014-04-22 Thread Keegan McAllister
l Message - From: "Luke Wagner" To: "Henri Sivonen" Cc: "Boris Zbarsky" , mozilla-dev-se...@lists.mozilla.org, "Robert O'Callahan" , "Eddy Bruel" Sent: Thursday, April 3, 2014 9:02:38 AM Subject: Re: [dev-servo] character encoding in the HT

Re: [dev-servo] character encoding in the HTML parser

2014-04-03 Thread Luke Wagner
Another option we've just been discussing is to lazily compute a flag on the string indicating "contents are 7-bit ascii" that allowed us to use array indexing. I'd expect this to often be true. There are also many cases where we'd eagerly have this flag (atoms produced during parsing, strings

Re: [dev-servo] character encoding in the HTML parser

2014-04-03 Thread Boris Zbarsky
On 4/3/14 8:03 AM, Henri Sivonen wrote: Have we instrumented Gecko to find out what the access patterns are like? We have not, but I will bet money the answer is "different for benchmarks and actual content"... -Boris ___ dev-servo mailing list dev

Re: [dev-servo] character encoding in the HTML parser

2014-04-03 Thread Henri Sivonen
On Wed, Apr 2, 2014 at 4:25 PM, Robert O'Callahan wrote: > If we could get the JS engine to use evil-UTF8 with some hack to handle > charAt and friends efficiently (e.g. tacking on a UCS-2 version of the > string when necessary) Have we instrumented Gecko to find out what the access patterns are

Re: [dev-servo] character encoding in the HTML parser

2014-04-03 Thread Henri Sivonen
On Tue, Apr 1, 2014 at 12:50 PM, Simon Sapin wrote: > On 01/04/2014 03:01, Keegan McAllister wrote: >> >> It does seem like replacing truly lone surrogates with U+FFFD would >> be an acceptable deviation from the spec, but maybe we want to avoid >> those absolutely. > > As much as I’d like this to

Re: [dev-servo] character encoding in the HTML parser

2014-04-02 Thread Robert O'Callahan
On Wed, Apr 2, 2014 at 10:45 AM, Luke Wagner wrote: > Nick just brought up the topic of adding a second compact string > representation to the JS engine (motivated by string memory use). One > question was whether to use ASCII (which V8 does, iirc) or UTF8. Several > DOM people have pointed out

Re: [dev-servo] character encoding in the HTML parser

2014-04-02 Thread Luke Wagner
Nick just brought up the topic of adding a second compact string representation to the JS engine (motivated by string memory use). One question was whether to use ASCII (which V8 does, iirc) or UTF8. Several DOM people have pointed out over the years that if SM would accept UTF8 it'd be really

Re: [dev-servo] character encoding in the HTML parser

2014-04-02 Thread Robert O'Callahan
On Tue, Apr 1, 2014 at 3:18 PM, Boris Zbarsky wrote: > On 4/1/14 3:07 PM, Keegan McAllister wrote: > >> Who should I talk to about JS string representation changes? >> > > Eddy Bruel. ejpbruel at mozilla dot com. If we could get the JS engine to use evil-UTF8 with some hack to handle charAt an

Re: [dev-servo] character encoding in the HTML parser

2014-04-01 Thread Boris Zbarsky
On 4/1/14 3:07 PM, Keegan McAllister wrote: Who should I talk to about JS string representation changes? Eddy Bruel. ejpbruel at mozilla dot com. -Boris ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-se

Re: [dev-servo] character encoding in the HTML parser

2014-04-01 Thread Keegan McAllister
> The JS folks are very > interested in trying to move away from pure-UCS-2, for memory reasons... That's very interesting. Servo seems like a good place to prototype that with a DOM to match. Who should I talk to about JS string representation changes? keegan

Re: [dev-servo] character encoding in the HTML parser

2014-04-01 Thread Simon Sapin
On 01/04/2014 03:01, Keegan McAllister wrote: It does seem like replacing truly lone surrogates with U+FFFD would be an acceptable deviation from the spec, but maybe we want to avoid those absolutely. As much as I’d like this to be true, I don’t know. Henri seemed pretty opposed to changing th

Re: [dev-servo] character encoding in the HTML parser

2014-03-31 Thread Boris Zbarsky
On 3/31/14 10:01 PM, Keegan McAllister wrote: Good point. Even if the DOM is a mix of UTF-8 lazily converted to UCS-2, the argument to document.write or the innerHTML setter is a JS string. Which may itself in the future be some mix of UCS-2 and ascii, or UCS-2 and "evil UTF-8", or just UTF-

Re: [dev-servo] character encoding in the HTML parser

2014-03-31 Thread Keegan McAllister
> Unfortunately I’d be less surprised if someone relies on having the two > halves of a surrogate pair in separate document.write() call, as this > seems more interoperable: > > data:text/html,document.write("\uD83D");document.write("\uDCA9") The tokenizer's input is a queue of buffers, and I'm

Re: [dev-servo] character encoding in the HTML parser

2014-03-31 Thread James Graham
On 10/03/14 23:54, Keegan McAllister wrote: Should we implement character encoding detection [1] at the same time as the rest of the HTML parser? It seems to be separable; the only design interactions I see are: - The character decoder and script APIs can write into the same input stream - The

Re: [dev-servo] character encoding in the HTML parser

2014-03-30 Thread Simon Sapin
On 29/03/2014 22:56, Simon Sapin wrote: On 10/03/2014 23:54, Keegan McAllister wrote: Speaking of which, [5] Any character that is a not a Unicode character, i.e. any isolated surrogate, is a parse error. (These can only find their way into the input stream via script APIs such as document.wri

Re: [dev-servo] character encoding in the HTML parser

2014-03-30 Thread Simon Sapin
On 29/03/2014 23:15, Boris Zbarsky wrote: On 3/29/14 6:56 PM, Simon Sapin wrote: Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8 without the artificial restriction of not encoding surrogates. http://en.wikipedia.org/wiki/CESU-8 CESU-8 is evil too, but it’s not what I had i

Re: [dev-servo] character encoding in the HTML parser

2014-03-29 Thread Boris Zbarsky
On 3/29/14 6:56 PM, Simon Sapin wrote: Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8 without the artificial restriction of not encoding surrogates. http://en.wikipedia.org/wiki/CESU-8 As far as I understand, a "parse error" in the spec is meant for conformance checkers (

Re: [dev-servo] character encoding in the HTML parser

2014-03-29 Thread Simon Sapin
On 10/03/2014 23:54, Keegan McAllister wrote: [...] Also, should we follow Gecko in representing the input stream as a queue of UTF-16 buffers? With UTF-8 we would have about half as much data to stream through the parser, and (on 64-bit) we could do case-insensitive ASCII operations 8 characte

[dev-servo] character encoding in the HTML parser

2014-03-10 Thread Keegan McAllister
Should we implement character encoding detection [1] at the same time as the rest of the HTML parser? It seems to be separable; the only design interactions I see are: - The character decoder and script APIs can write into the same input stream - The encoding can change during parsing [2], whic