On 08/30/2018 11:21 AM, Henri Sivonen wrote:
We have the following that a pattern in our code base:

  1) SetCapacity(newCapacity) is called on an XPCOM string.
  2) A pointer obtained from BeginWriting() is used for writing more
than Length() but no more than newCapacity code units to the XPCOM
string.
  3) SetLength(actuallyWritten)  is called on the XPCOM string such
that actuallyWritten > Length() but actuallyWritten <= newCapacity.

This is an encapsulation violation, because the string implementation
doesn't know that the content past Length() is there.
How so? The whole point of capacity is that the string has that much capacity.



 The caller
assumes that step #3 will not reallocate and will only write a
zero-terminator at actuallyWritten and set mLength to actuallyWritten.
(The pattern is common enough that clearly people have believed it to
be a valid pattern. However, I haven't seen any in-tree or on-wiki
string documentation endorsing the pattern.)

It should be non-controversial that this is an encapsulation
violation,
Well, I'm not seeing any encapsulation violation ;)


but does the encapsulation violation matter? It matters if
we want SetLength() to be able to conserve memory by allocating a
smaller buffer when actuallyWritten code units would fit in a smaller
mozjemalloc bucket.

Please be very very careful when doing allocations and deallocations. They are 
very slow, showing
up all the time in performance profiles.


 In order for the above pattern to work if
SetLength() can reallocate in such case, SetLength() has to memcpy the
whole buffer in case someone has written content that the string
implementation is unaware of instead of just memcpying the part of the
buffer that the string implementation knows to be in use. Pessimizing
the number of code units to memcpy is bad for performance.

It's unclear if trying to use a smaller mozjemalloc bucket it is a
worthwhile thing. It obviously is for large long-lived strings and it
obviously isn't for short-lived strings even if they are large.
SetLength() doesn't know what the future holds for the string. :-( But
in any case, it's bad if we can't make changes that are sound from the
perspective of the string encapsulation, because we have other code
violating the encapsulation.

After the soft freeze, I'd like to change things so that we memcpy
only the part of the buffer that the string implementation knows is in
use. To that end, we should stop using the above pattern that is an
encapsulation violation.

What is then the point of SetCapacity anymore?


For m-c, I've filed
https://bugzilla.mozilla.org/show_bug.cgi?id=1472113 and worked on
fixing the bugs that block it. For c-c, I've filed
https://bugzilla.mozilla.org/show_bug.cgi?id=1486706 but don't intend
to do the work of investigating or fixing the string usage in c-c.

As for fixing the above pattern, there are two alternatives. The first one is:

  1) SetLength(newCapacity)
  2) BeginWriting()
  3) Truncate(actuallyWritten) (or SetLength(actuallyWritten), Truncate
simply tells to the reader that the string isn't being made longer)

With this pattern, writing happens to the part of the buffer that the
string implementation believes to be in use. This has the downside
that the first SetLength() call (like, counter-intuitively,
SetCapacity() currently!) writes the zero terminator, which from the
point of view of CPU caches is an out-of-the-blue write that's not
part of a well-behaved forward-only linear write pattern and not
necessarily near recently-accessed locations.

The second alternative is BulkWrite() in C++ and bulk_write() in Rust.
The API doesn't seem to be too user friendly.


This is new API that is well-behaved in terms of the cache access
pattern and is also more versatile in the sense that it lets the
caller know how newCapacity was rounded up, which is relevant to
callers that ask for best-case capacity and then ask more capacity if
there turns out to be more to write. When the caller is made aware of
the rounding, a second request for added capacity can be avoided if
the amount that actually needs to be written exceeds the best case
estimate but fits within the rounded-up capacity.

In Rust, bulk_write() is rather nicely misuse-resistant.  However, on
the C++ side the lack of a borrow checker as well as mozilla::Result
not working with move-only types
(https://bugzilla.mozilla.org/show_bug.cgi?id=1418624) pushes more
things to documentation. The documentation can be found at
https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Guide/Internal_strings#Bulk_Write
https://searchfox.org/mozilla-central/rev/2fe43133dbc774dda668d4597174d73e3969181a/xpcom/string/nsTSubstring.h#1190
https://searchfox.org/mozilla-central/rev/2fe43133dbc774dda668d4597174d73e3969181a/xpcom/string/nsTSubstring.h#32

P.S. GetMutableData() is redundant with BeginWriting() and
SetLength(). It's used very rarely and I'd like to remove it as
redundant, so please don't use GetMutableData().


_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to