On 24/05/2025 19:56, Andy via Gcc wrote:
Dear GCC developers,
I would like to ask whether there might be room for improvement in memory
access optimization in GCC.
I've prepared a simple benchmark in both C++ (using -std=c++20 for digit
separators like 5'000'000) and Java. The benchmark allocates a large array of
random integers, performs a multiply-sum loop to prevent
dead-code elimination, and measures the time taken to sort the array.
C++ example:
When you are trying to do benchmarking, be careful about what you are
measuring. You say you are looking at "memory access optimisation", but
you are measuring the speed of std::sort(). Obviously memory accesses
will be relevant to std::sort(), but they are most certainly not the
only factor - especially when comparing to a different sort in a very
different kind of language. Some of the possible differences between
your Java and your C++ include:
1. Java and your C++ library could have significantly different sort()
algorithms. Maybe the Java implementation is better suited to sizes
like 5 million, while the C++ implementation could be prioritising very
different sizes.
2. Java's JIT compiler will target the specific processor you are using.
Without -march flags, gcc will target a generic common subset of the
architecture you are using. Thus Java will use AVX512 and whatever
other useful features your processor might have - gcc will not.
Optimising code without using -march is often a waste of time -
"-march=native" can make much more of a difference than using "-O3" over
"-O2". (And sometimes "-O2" gives faster code than "-O3" - extreme
optimisation is more of an art than a science.)
3. In the same line, there are SIMD libraries that can give massive
improvements in sorting speed on the right processor. If the Java
implementation uses them, and the C++ implementation does not, then you
are learning absolutely nothing about memory access efficiency with gcc
- all you are learning is that these libraries are really impressive.
4. Your benchmarking is relying on luck and the limitations of the
compiler's optimiser in an attempt to measure what you want to measure.
That's never a good plan - either in C++ or in Java. You clearly have
some understanding that you need to do /something/ to stop code being
eliminated, but you are missing many other things. Again, writing good
benchmarks is hard.
gcc is a smart compiler, and gets smarter with each version. If you use
more powerful optimisation techniques - like LTO - you will get more
useful timing results of the actual work being done, and the speed of
the code for people who are using more powerful optimisation. (And if
you care a lot about the speed of a piece of code, then you should want
to use the most powerful optimisation techniques available.) But that
will also help the compiler see that much of what you are doing, does
not produce any useful results - and can therefore be eliminated.
You are trying to calculate a multiply-sum value for the numbers in the
vector as a way of forcing the compiler to make the vector. But since
"sum" is never used, the compiler can eliminate it - along with all the
calculations to generate it.
You are sorting the array. But the compiler might see that you never
use the sorted array, and thus skip the sorting. Then it could see that
you don't use the vector except to sort it, and eliminate that too. A
smart enough compiler could eliminate the whole thing - leaving you
nothing but a couple of calls to now() in your loop.
Then there are the calls to the high_resolution_clock. My guess is that
the compiler can't eliminate these or re-order them with respect to each
other. But it /can/ re-order them with respect to a lot of the other code.
The key to understanding this is the concept of "observable behaviour" -
basically, things that affect the input or the output of the program,
and anything using volatile accesses. If the compiler can figure that
changing, removing or reordering code will not affect any observable
behaviour, it can make those changes in order to give you smaller or
faster results (time is /not/ "observable" in C and C++). That means it
can shuffle around a lot of code involving local variables, and
functions that it knows do not have observable behaviour.
Your two main tools here are "volatile" and inline assembly, especially
memory barriers.
If you want to say "ensure that the vector numbers is completely
calculated here", use :
volatile int vi = 0;
volatile int vx = numbers[vi];
Because "vi" is volatile, the compiler cannot assume it is still 0 by
the second line. Because "vx" is volatile, the compiler is required to
read "numbers[vi]", which could be any element of "numbers", to assign
it to "vx". It has to do this, even if "vx" is never used after that.
The other useful tool is :
asm volatile ("" ::: "memory");
This tells the compiler that during the inline assembly, anything that
was supposed to be in memory might be read and might be changed. It
therefore blocks all sorts of re-arrangements and code eliminations, and
flushes any data held in registers. Small local variables, such as loop
indices, will be unaffected.
None of this gives an answer as to whether gcc is generating code that
accesses memory efficiently or not. But it might help you get a clearer
picture, and that in turn might help the gcc developers find weak spots
in the compiler that can be improved.
David