Re: [dev-servo] Minutes of the discussion regarding strings

2013-06-07 Thread Benjamin Smedberg

On 6/5/2013 8:42 PM, Patrick Walton wrote:
Topics covered: Interning, mutability, cost of creating string 
objects, encoding UTF-8 versus UTF-16.


https://github.com/mozilla/servo/wiki/Strings
I would love to have been invited to this meeting. Was it announced 
anywhere?


I absolutely agree that we shouldn't have a separate atom type. I've 
actually been hoping that we could replace nsIAtom in gecko with a 
string flag (INTERNED) which would shortcut fast comparisons, but 
initial patches I had to test that were rotted by the work to have atoms 
store both a UTF8 and UTF16 buffer, and I never revisited it.



  * Gecko has mutable strings and this is bad for performance

I'd like to understand/challenge this statement. Is this bad for 
performance because you have to check the extra SHARED flag on write? 
With auto-sharing forcing immutability, I have trouble believing this is 
a big deal in practice. The noticeable problem with auto-sharing right 
now is that it requires threadsafe refcounting, which *does* show up in 
benchmarks, but that would continue to be a problem with immutable 
strings, if they needed to be thread-shareable. Was there discussion 
about whether these strings would be at all threadsafe (and the 
interning table)?


My primary concern with string builders is that they typically 
reallocate when you convert the builder to an immutable string. If we 
can avoid that case by reassigning the buffer, then I think most of my 
objections go away.




  
Cost
  of creating string objects

  * Constructors and especially destructors are expensive
  * No static typing
  * JS string comes in, want to create a Gecko DependentString,
constructor was expensive because it had to check whether it was a
DependentString
  * Would be nice to avoid hacks like that
  * 3 cases that Gecko has: ref counted versus owned versus dependent
string versus null-terminated versus stack buffer
  * Stay as simple as possible, don't add new string types unless
they're really necessary!

I love the sentiments here, and I share our frustration with complicated 
systems. But pretty much all of our string hacks, including JS dependent 
strings and XPCOM dependent/literal strings exist because they solved 
very noticeable performance problems. JS ropes were added in bug 571549, 
for example, which is definitely not ancient history. It's worth 
exploring whether we can remove that need by simplifying the string 
classes, but I'm very wary of generic advice to "stay as simple as 
possible" when we have prior history which indicates that simple doesn't 
perform well.


Was there discussion about whether string buffers should be refcounted 
or GCed (or copied, but I'm pretty sure that would cause memory explosion)?


--BDS

___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Minutes of the discussion regarding strings

2013-06-07 Thread Patrick Walton

On 6/7/13 6:43 AM, Benjamin Smedberg wrote:

On 6/5/2013 8:42 PM, Patrick Walton wrote:

Topics covered: Interning, mutability, cost of creating string
objects, encoding UTF-8 versus UTF-16.

https://github.com/mozilla/servo/wiki/Strings

I would love to have been invited to this meeting. Was it announced
anywhere?


It was kind of an impromptu thing; should have been announced more 
widely, sorry.



  * Gecko has mutable strings and this is bad for performance


I'd like to understand/challenge this statement. Is this bad for
performance because you have to check the extra SHARED flag on write?
With auto-sharing forcing immutability, I have trouble believing this is
a big deal in practice. The noticeable problem with auto-sharing right
now is that it requires threadsafe refcounting, which *does* show up in
benchmarks, but that would continue to be a problem with immutable
strings, if they needed to be thread-shareable. Was there discussion
about whether these strings would be at all threadsafe (and the
interning table)?


The tentative conclusion was that thread safety is not needed for the 
interning table, because the layout thread does not need to intern 
strings (and CSS parsing is handled on the script thread in Servo), only 
to access the contents of interned strings. In the rare cases in which 
layout would need to hang onto a non-static interned string across 
invocations it could just copy the strings, but bz felt that such cases 
would be rare.



My primary concern with string builders is that they typically
reallocate when you convert the builder to an immutable string. If we
can avoid that case by reassigning the buffer, then I think most of my
objections go away.


In Rust a "string builder" is just a mutable unique string, and it won't 
reallocate when you convert it to immutable. (This is the 
"freeze"/"thaw" pattern.)



Was there discussion about whether string buffers should be refcounted
or GCed (or copied, but I'm pretty sure that would cause memory explosion)?


Ref counting versus GC is determined on a case-by-case basis in Servo. 
There's no one-size-fits-all solution: we're using threadsafe reference 
counting, or unique strings, or possibly-interned strings, as the 
situation calls for it. Admittedly this is kind of a non-answer. :) I'd 
be curious as to which specific situations you had in mind.


Patrick

___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Minutes of the discussion regarding strings

2013-06-07 Thread Boris Zbarsky

On 6/7/13 9:43 AM, Benjamin Smedberg wrote:

I would love to have been invited to this meeting. Was it announced
anywhere?


It sort of grew out of an IRC conversation over the course of 20 mins or 
so.  :(



  * Gecko has mutable strings and this is bad for performance


I'd like to understand/challenge this statement. Is this bad for
performance because you have to check the extra SHARED flag on write?


My context here is things like the code I changed in 
http://hg.mozilla.org/mozilla-central/rev/e5cc69819435 which was in 
theory trying to follow best practices (calling SetCapacity, etc).  But 
then it also did appends one character at a time and each append had to 
do a bunch of "compute the new capacity and see whether we have that 
much", which was fairly expensive on the scale of the overall timing on 
that microbenchmark.  That initial SetCapacity call actually made things 
worse, not better, in common cases, by being an expensive no-op.


Basically, every time you go to append to an nsAString it goes through a 
_lot_ of code to make sure that it can end up just appending your stuff. 
 Maybe that's just an implementation problem with XPCOM strings, not a 
general issue with mutable strings, of course.


And if you want to insert it's even more expensive.  On the other hand, 
inserts into an existing string, are not exactly very easy in a 
stringbuilder setup...



I love the sentiments here, and I share our frustration with complicated
systems. But pretty much all of our string hacks, including JS dependent
strings and XPCOM dependent/literal strings exist because they solved
very noticeable performance problems. JS ropes were added in bug 571549,
for example, which is definitely not ancient history. It's worth
exploring whether we can remove that need by simplifying the string
classes, but I'm very wary of generic advice to "stay as simple as
possible" when we have prior history which indicates that simple doesn't
perform well.


I think the problem is that the complex thing also does not perform well 
in many cases.  See FakeDependentString and the DOMString struct in 
Gecko which try to work around by not creating an XPCOM string at all...


The question is whether these cases should just be special goop that 
bindings do and DOM code knows about or whether the "normal" string 
types can serve that need.



Was there discussion about whether string buffers should be refcounted
or GCed


There wasn't, no.  It's an interesting question.  There has been talk 
over on the JS side to allow refcounting their strings, which would also 
be very interesting since it would allow the rendering engine to share 
their buffers as long as the encoding agrees.


-Boris

___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Minutes of the discussion regarding strings

2013-06-07 Thread Benjamin Smedberg

On 6/7/2013 10:03 AM, Patrick Walton wrote:





Was there discussion about whether string buffers should be refcounted
or GCed (or copied, but I'm pretty sure that would cause memory 
explosion)?


Ref counting versus GC is determined on a case-by-case basis in Servo. 
There's no one-size-fits-all solution: we're using threadsafe 
reference counting, or unique strings, or possibly-interned strings, 
as the situation calls for it. Admittedly this is kind of a 
non-answer. :) I'd be curious as to which specific situations you had 
in mind.


Well... I don't think I understand the answer yet.

If strings are immutable, you have two basic options for passing them 
around:


* You can pass pointers to the actual string objects around. This 
requires that pretty much all code has a shared understanding of whether 
the objects are refcounted or GCed, I think. You *may* have the 
possibility to allocate both the "string object" and its backing buffer 
in a single allocation, which potentially saves memory. This is the 
pattern in the JS engine.


* String objects are lightweight (flags + pointer to buffer). String 
assignment just shares the buffer. This is the current XPCOM pattern. In 
this case, the actual string objects could be inline or separately 
allocated. But then the question is really about the buffers: would it 
make more sense to GC or refcount them? Note that if strings are 
mutable, then this is really the only sane way to pass strings around, 
since you use copy-on-write semantics for the buffers.


Currently in XPCOM, buffers use threadsafe refcounting because we do 
pass strings between threads (mainly in networking-land). If we are sure 
that these strings are not going to be assigned across threads, we 
should probably just use non-threadsafe refcounting for the buffers.


I haven't spent a lot of time understanding rust or servo architecture 
other than reading the occasional traffic on this list (is there a guide 
doc now?). I understand that layout operates on a separate task, but it 
wasn't clear what kinds of structures were being sent to that task or 
how they were synchronized.


--BDS

___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Minutes of the discussion regarding strings

2013-06-07 Thread Benjamin Smedberg

On 6/7/2013 11:27 AM, Boris Zbarsky wrote:


My context here is things like the code I changed in 
http://hg.mozilla.org/mozilla-central/rev/e5cc69819435 which was in 
theory trying to follow best practices (calling SetCapacity, etc).  
But then it also did appends one character at a time and each append 
had to do a bunch of "compute the new capacity and see whether we have 
that much", which was fairly expensive on the scale of the overall 
timing on that microbenchmark.  That initial SetCapacity call actually 
made things worse, not better, in common cases, by being an expensive 
no-op.


Basically, every time you go to append to an nsAString it goes through 
a _lot_ of code to make sure that it can end up just appending your 
stuff.  Maybe that's just an implementation problem with XPCOM 
strings, not a general issue with mutable strings, of course.


And if you want to insert it's even more expensive.  On the other 
hand, inserts into an existing string, are not exactly very easy in a 
stringbuilder setup...
Ah yeah, that's not a good pattern in general. Ideally mutating a string 
in XPCOM should be:


* call BeginWriting with the desired length, get a buffer pointer
* manipulate that buffer pointer
* when you're done, call SetLength to the correct final length

I agree if we can get efficient stringbuilder behavior that immutable 
strings are in general better, and something we should aim for.




I think the problem is that the complex thing also does not perform 
well in many cases.  See FakeDependentString and the DOMString struct 
in Gecko which try to work around by not creating an XPCOM string at 
all...


The question is whether these cases should just be special goop that 
bindings do and DOM code knows about or whether the "normal" string 
types can serve that need.
I wasn't aware of these; I'll follow up with you about them separately 
since it may be just that our string code is doing something dumb.


--BDS

___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo


Re: [dev-servo] Minutes of the discussion regarding strings

2013-06-07 Thread Nicholas Nethercote
On Thu, Jun 6, 2013 at 11:09 AM, Robert O'Callahan  wrote:
> Immutable strings sound good to me.
>
> How hard would it be to add UTF-8 strings to Spidermonkey? JSString already
> has a lot of "modes", perhaps one more wouldn't hurt :-). I'm imagining
> that for anything that required UTF-16 (charAt etc) you'd convert the
> string internals to UTF-16, and for passing into WebIDL we'd convert string
> internals to UTF-8.

I asked this a while ago and Waldo said it would be very difficult,
but I can't remember why.  I've CC'd him... (and Terrence, who has
been doing some string-related changes).

Nick
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo