/html5/blob/sse/src/tokenizer/buffer_queue.rs#L182-L252
- Original Message -
From: "Keegan McAllister"
To: "Luke Wagner"
Cc: "Henri Sivonen" , "Boris Zbarsky" ,
mozilla-dev-se...@lists.mozilla.org, "Robert O'Callahan"
, "Eddy B
l Message -
From: "Luke Wagner"
To: "Henri Sivonen"
Cc: "Boris Zbarsky" , mozilla-dev-se...@lists.mozilla.org,
"Robert O'Callahan" , "Eddy Bruel"
Sent: Thursday, April 3, 2014 9:02:38 AM
Subject: Re: [dev-servo] character encoding in the HT
Another option we've just been discussing is to lazily compute a flag on the
string indicating "contents are 7-bit ascii" that allowed us to use array
indexing. I'd expect this to often be true. There are also many cases where
we'd eagerly have this flag (atoms produced during parsing, strings
On 4/3/14 8:03 AM, Henri Sivonen wrote:
Have we instrumented Gecko to find out what the access patterns are
like?
We have not, but I will bet money the answer is "different for
benchmarks and actual content"...
-Boris
___
dev-servo mailing list
dev
On Wed, Apr 2, 2014 at 4:25 PM, Robert O'Callahan wrote:
> If we could get the JS engine to use evil-UTF8 with some hack to handle
> charAt and friends efficiently (e.g. tacking on a UCS-2 version of the
> string when necessary)
Have we instrumented Gecko to find out what the access patterns are
On Tue, Apr 1, 2014 at 12:50 PM, Simon Sapin wrote:
> On 01/04/2014 03:01, Keegan McAllister wrote:
>>
>> It does seem like replacing truly lone surrogates with U+FFFD would
>> be an acceptable deviation from the spec, but maybe we want to avoid
>> those absolutely.
>
> As much as I’d like this to
On Wed, Apr 2, 2014 at 10:45 AM, Luke Wagner wrote:
> Nick just brought up the topic of adding a second compact string
> representation to the JS engine (motivated by string memory use). One
> question was whether to use ASCII (which V8 does, iirc) or UTF8. Several
> DOM people have pointed out
Nick just brought up the topic of adding a second compact string representation
to the JS engine (motivated by string memory use). One question was whether to
use ASCII (which V8 does, iirc) or UTF8. Several DOM people have pointed out
over the years that if SM would accept UTF8 it'd be really
On Tue, Apr 1, 2014 at 3:18 PM, Boris Zbarsky wrote:
> On 4/1/14 3:07 PM, Keegan McAllister wrote:
>
>> Who should I talk to about JS string representation changes?
>>
>
> Eddy Bruel. ejpbruel at mozilla dot com.
If we could get the JS engine to use evil-UTF8 with some hack to handle
charAt an
On 4/1/14 3:07 PM, Keegan McAllister wrote:
Who should I talk to about JS string representation changes?
Eddy Bruel. ejpbruel at mozilla dot com.
-Boris
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-se
> The JS folks are very
> interested in trying to move away from pure-UCS-2, for memory reasons...
That's very interesting. Servo seems like a good place to prototype that with
a DOM to match.
Who should I talk to about JS string representation changes?
keegan
On 01/04/2014 03:01, Keegan McAllister wrote:
It does seem like replacing truly lone surrogates with U+FFFD would
be an acceptable deviation from the spec, but maybe we want to avoid
those absolutely.
As much as I’d like this to be true, I don’t know. Henri seemed pretty
opposed to changing th
On 3/31/14 10:01 PM, Keegan McAllister wrote:
Good point. Even if the DOM is a mix of UTF-8 lazily converted to UCS-2, the
argument to document.write or the innerHTML setter is a JS string.
Which may itself in the future be some mix of UCS-2 and ascii, or UCS-2
and "evil UTF-8", or just UTF-
> Unfortunately I’d be less surprised if someone relies on having the two
> halves of a surrogate pair in separate document.write() call, as this
> seems more interoperable:
>
> data:text/html,document.write("\uD83D");document.write("\uDCA9")
The tokenizer's input is a queue of buffers, and I'm
On 10/03/14 23:54, Keegan McAllister wrote:
Should we implement character encoding detection [1] at the same time
as the rest of the HTML parser? It seems to be separable; the only
design interactions I see are:
- The character decoder and script APIs can write into the same input
stream - The
On 29/03/2014 22:56, Simon Sapin wrote:
On 10/03/2014 23:54, Keegan McAllister wrote:
Speaking of which, [5]
Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into
the input stream via script APIs such as document.wri
On 29/03/2014 23:15, Boris Zbarsky wrote:
On 3/29/14 6:56 PM, Simon Sapin wrote:
Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8
without the artificial restriction of not encoding surrogates.
http://en.wikipedia.org/wiki/CESU-8
CESU-8 is evil too, but it’s not what I had i
On 3/29/14 6:56 PM, Simon Sapin wrote:
Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8
without the artificial restriction of not encoding surrogates.
http://en.wikipedia.org/wiki/CESU-8
As far as I understand, a "parse error" in the spec is meant for
conformance checkers (
On 10/03/2014 23:54, Keegan McAllister wrote:
[...]
Also, should we follow Gecko in representing the input stream as a
queue of UTF-16 buffers? With UTF-8 we would have about half as much
data to stream through the parser, and (on 64-bit) we could do
case-insensitive ASCII operations 8 characte
Should we implement character encoding detection [1] at the same time as the
rest of the HTML parser? It seems to be separable; the only design
interactions I see are:
- The character decoder and script APIs can write into the same input stream
- The encoding can change during parsing [2], whic
20 matches
Mail list logo