<snip, CppCon 2014> > > *- Facebook (Andrei Alexandrescu) "cheats" with "stealing bits" from > their implementation of a "shared-pointer" to "steal-bits" for the > "object-count", with the understanding that the count rarely goes above > "4". Correctness is sacrificed to increase performance 0.4%, which is a > "huge win" for server load and electricity (saves $millions a month). > "Leak, Log, and Leave" is what they do for the uncommon correctness > failure-case. >
Oliver commented: So my first thought is that since they observed that "the reference count > rarely goes above 4" that they restrict their counters to N bits only, > where N < 8 ("Byte"). > > So let's say they'd use only 4 bits for their reference counters in their > "smart pointer" implementation. If you placed all the reference counters > into an "external array structure", "array of char" (which would - somehow > - be "globally accessible" without storing an extra pointer to it in each > smart pointer instance), then you could place two reference counters into a > char, hence reducing the memory usage by factor 2 (compared to reference > counters implemented as "char" (byte) only (*)). > > So you save a little (how much?) memory. Let's be generous: let's say a > typical application of theirs would consume 20% less RAM with such "memory > optimised counters". Saving a few GB of RAM on their servers could probably > indeed save you some bucks on the electricity bill. > > > But off course you'd need to keep track of reference counters which go to > 0 (-> "re-use") - or you just let the "array of compressed reference > counters" grow. > > But then the biggest matter is off course leaked memory: once such a "4 > bit reference counter" gets "saturated", you have to "mark" that smart > pointer as "is leaking": there is no way the involved object could ever be > safely deleted. Depending on the size of the involved objects I don't think > you'd have a net gain in memory footprint reduction, especially on server > applications (unless you restart the server instance for "extensive garbage > collection" every 24 hours or so ;)). Besides, you'd need extra state > variables to mark the smart pointer as "leaking" (or you'd say that > MAX_COUNT of the counter implies "leaking"). > > > So I don't think by "stealing bits" they want to save on memory usage, do > they? > > Hence the big question: how then? Are there arithmetic operations such as > "Increment/Decrement" which are faster on "a coupble of bits" than, say, an > Integer (32 or 64 bit)? Other "bit fiddling" than one can exploit on such > small values that I cannot think of right now? > > ("Parallel incrementing" two values packed into a single value comes to my > mind - but such SIMD instructions exist for Integer (Float? Double?), and > not for char values, right? And even so: how would that help if you wanted > to increment ONE counter in a given smart pointer instance?) > My understanding is that Facebook's internal "shared_ptr<>" uses a "Tagged Pointer" implementation, where it is merely a single "T*" address, and not a (T*,int) for the address-and-count. See: http://en.wikipedia.org/wiki/Tagged_pointer So, although the pointer is 64-bits, current processors actually only support 48 or 56 bits, so there is apparently a lot of leftover room for burgling: https://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses In this case, they get HUGE savings because the entire shared_ptr<> now entirely fits in 64-bits, so it entirely fits in the L1 cache. Smaller executable size also, but invalidating the L1 cache is apparently, "the biggie". You can't put a shared_ptr<> with a 64-bit pointer in the L1 cache if you also have a companion "int" for the "ref-count", so they removed it. Other talks also suggested that you can similarly gain performance improvement if you push the "most-frequently-accessed-members" of your object into the first-64-bits (because that's the first part loaded into the L1 cache), and if you don't need the other members, you similarly "win". But, your speculation is very interesting for what the Game industry seems to be doing -- they love these external (raw) vectors-of-data, and then externally write functions that rip-through that data very fast, or in parallel, or push the whole thing to the GPU where it resides. This is my current understanding ... the videos will be posted in the next couple weeks, and we can correct my (perhaps incorrect) description. --charley
_______________________________________________ Interest mailing list Interest@qt-project.org http://lists.qt-project.org/mailman/listinfo/interest