> But, your speculation is very interesting for what the Game industry seems to 
> be doing -- they love these external (raw) vectors-of-data, and then 
> externally write functions that rip-through that data very fast, or in 
> parallel, or push the whole thing to the GPU where it resides.

The important part indeed is to store a vector of objects, not a vector of 
pointers to objects. This is an idea that has really been pushed quite hard for 
a while now.

For example check out the blog post here: 
http://bannalia.blogspot.fi/2014/05/fast-polymorphic-collections.html Note the 
graphs at the end - the size of the container does not impact iteration time 
per element when you stop storing pointers to objects.

It's the same thing Herb Sutter talks about here: 
http://channel9.msdn.com/Events/Build/2014/2-661 (starting at around 23-24 
minutes in). The talk is geared very much towards C++ newbies, but the numbers 
that he presents are quite interesting indeed.

-- Louai

________________________________________
From: interest-bounces+louai.al-khanji=digia....@qt-project.org 
[interest-bounces+louai.al-khanji=digia....@qt-project.org] on behalf of 
charleyb123 . [charleyb...@gmail.com]
Sent: Monday, September 15, 2014 9:55 PM
To: Till Oliver Knoll
Cc: Qt Project
Subject: Re: [Interest] CppCon Just Ended (next year will be 20-25 Sep 2015)

<snip, CppCon 2014>
> *- Facebook (Andrei Alexandrescu) "cheats" with "stealing bits" from their 
> implementation of a "shared-pointer" to "steal-bits" for the "object-count", 
> with the understanding that the count rarely goes above "4".  Correctness is 
> sacrificed to increase performance 0.4%, which is a "huge win" for server 
> load and electricity (saves $millions a month).  "Leak, Log, and Leave" is 
> what they do for the uncommon correctness failure-case.

Oliver commented:

So my first thought is that since they observed that "the reference count 
rarely goes above 4" that they restrict their counters to N bits only, where N 
< 8 ("Byte").

So let's say they'd use only 4 bits for their reference counters in their 
"smart pointer" implementation. If you placed all the reference counters into 
an "external array structure", "array of char" (which would - somehow - be 
"globally accessible" without storing an extra pointer to it in each smart 
pointer instance), then you could place two reference counters into a char, 
hence reducing the memory usage by factor 2 (compared to reference counters 
implemented as "char" (byte) only (*)).

So you save a little (how much?) memory. Let's be generous: let's say a typical 
application of theirs would consume 20% less RAM with such "memory optimised 
counters". Saving a few GB of RAM on their servers could probably indeed save 
you some bucks on the electricity bill.


But off course you'd need to keep track of reference counters which go to 0 (-> 
"re-use") - or you just let the "array of compressed reference counters" grow.

But then the biggest matter is off course leaked memory: once such a "4 bit 
reference counter" gets "saturated", you have to "mark" that smart pointer as 
"is leaking": there is no way the involved object could ever be safely deleted. 
Depending on the size of the involved objects I don't think you'd have a net 
gain in memory footprint reduction, especially on server applications (unless 
you restart the server instance for "extensive garbage collection" every 24 
hours or so ;)). Besides, you'd need extra state variables to mark the smart 
pointer as "leaking" (or you'd say that MAX_COUNT of the counter implies 
"leaking").


So I don't think by "stealing bits" they want to save on memory usage, do they?

Hence the big question: how then? Are there arithmetic operations such as 
"Increment/Decrement" which are faster on "a coupble of bits" than, say, an 
Integer (32 or 64 bit)? Other "bit fiddling" than one can exploit on such small 
values that I cannot think of right now?

("Parallel incrementing" two values packed into a single value comes to my mind 
- but such SIMD instructions exist for Integer (Float? Double?), and not for 
char values, right? And even so: how would that help if you wanted to increment 
ONE counter in a given smart pointer instance?)

My understanding is that Facebook's internal "shared_ptr<>" uses a "Tagged 
Pointer" implementation, where it is merely a single "T*" address, and not a 
(T*,int) for the address-and-count.  See:  
http://en.wikipedia.org/wiki/Tagged_pointer

So, although the pointer is 64-bits, current processors actually only support 
48 or 56 bits, so there is apparently a lot of leftover room for burgling:  
https://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses

In this case, they get HUGE savings because the entire shared_ptr<> now 
entirely fits in 64-bits, so it entirely fits in the L1 cache.  Smaller 
executable size also, but invalidating the L1 cache is apparently, "the 
biggie".  You can't put a shared_ptr<> with a 64-bit pointer in the L1 cache if 
you also have a companion "int" for the "ref-count", so they removed it.

Other talks also suggested that you can similarly gain performance improvement 
if you push the "most-frequently-accessed-members" of your object into the 
first-64-bits (because that's the first part loaded into the L1 cache), and if 
you don't need the other members, you similarly "win".

But, your speculation is very interesting for what the Game industry seems to 
be doing -- they love these external (raw) vectors-of-data, and then externally 
write functions that rip-through that data very fast, or in parallel, or push 
the whole thing to the GPU where it resides.

This is my current understanding ... the videos will be posted in the next 
couple weeks, and we can correct my (perhaps incorrect) description.

--charley
_______________________________________________
Interest mailing list
Interest@qt-project.org
http://lists.qt-project.org/mailman/listinfo/interest

Reply via email to