On 24/05/2025 19:56, Andy via Gcc wrote:
Dear GCC developers,

I would like to ask whether there might be room for improvement in memory
access optimization in GCC.

I've prepared a simple benchmark in both C++ (using -std=c++20 for digit
separators like 5'000'000) and Java. The benchmark allocates a large array of
random integers, performs a multiply-sum loop to prevent
dead-code elimination, and measures the time taken to sort the array.
C++ example:


When you are trying to do benchmarking, be careful about what you are measuring. You say you are looking at "memory access optimisation", but you are measuring the speed of std::sort(). Obviously memory accesses will be relevant to std::sort(), but they are most certainly not the only factor - especially when comparing to a different sort in a very different kind of language. Some of the possible differences between your Java and your C++ include:

1. Java and your C++ library could have significantly different sort() algorithms. Maybe the Java implementation is better suited to sizes like 5 million, while the C++ implementation could be prioritising very different sizes.

2. Java's JIT compiler will target the specific processor you are using. Without -march flags, gcc will target a generic common subset of the architecture you are using. Thus Java will use AVX512 and whatever other useful features your processor might have - gcc will not. Optimising code without using -march is often a waste of time - "-march=native" can make much more of a difference than using "-O3" over "-O2". (And sometimes "-O2" gives faster code than "-O3" - extreme optimisation is more of an art than a science.)

3. In the same line, there are SIMD libraries that can give massive improvements in sorting speed on the right processor. If the Java implementation uses them, and the C++ implementation does not, then you are learning absolutely nothing about memory access efficiency with gcc - all you are learning is that these libraries are really impressive.

4. Your benchmarking is relying on luck and the limitations of the compiler's optimiser in an attempt to measure what you want to measure. That's never a good plan - either in C++ or in Java. You clearly have some understanding that you need to do /something/ to stop code being eliminated, but you are missing many other things. Again, writing good benchmarks is hard.

gcc is a smart compiler, and gets smarter with each version. If you use more powerful optimisation techniques - like LTO - you will get more useful timing results of the actual work being done, and the speed of the code for people who are using more powerful optimisation. (And if you care a lot about the speed of a piece of code, then you should want to use the most powerful optimisation techniques available.) But that will also help the compiler see that much of what you are doing, does not produce any useful results - and can therefore be eliminated.

You are trying to calculate a multiply-sum value for the numbers in the vector as a way of forcing the compiler to make the vector. But since "sum" is never used, the compiler can eliminate it - along with all the calculations to generate it.

You are sorting the array. But the compiler might see that you never use the sorted array, and thus skip the sorting. Then it could see that you don't use the vector except to sort it, and eliminate that too. A smart enough compiler could eliminate the whole thing - leaving you nothing but a couple of calls to now() in your loop.

Then there are the calls to the high_resolution_clock. My guess is that the compiler can't eliminate these or re-order them with respect to each other. But it /can/ re-order them with respect to a lot of the other code.

The key to understanding this is the concept of "observable behaviour" - basically, things that affect the input or the output of the program, and anything using volatile accesses. If the compiler can figure that changing, removing or reordering code will not affect any observable behaviour, it can make those changes in order to give you smaller or faster results (time is /not/ "observable" in C and C++). That means it can shuffle around a lot of code involving local variables, and functions that it knows do not have observable behaviour.

Your two main tools here are "volatile" and inline assembly, especially memory barriers.

If you want to say "ensure that the vector numbers is completely calculated here", use :

        volatile int vi = 0;
        volatile int vx = numbers[vi];

Because "vi" is volatile, the compiler cannot assume it is still 0 by the second line. Because "vx" is volatile, the compiler is required to read "numbers[vi]", which could be any element of "numbers", to assign it to "vx". It has to do this, even if "vx" is never used after that.

The other useful tool is :

        asm volatile ("" ::: "memory");

This tells the compiler that during the inline assembly, anything that was supposed to be in memory might be read and might be changed. It therefore blocks all sorts of re-arrangements and code eliminations, and flushes any data held in registers. Small local variables, such as loop indices, will be unaffected.


None of this gives an answer as to whether gcc is generating code that accesses memory efficiently or not. But it might help you get a clearer picture, and that in turn might help the gcc developers find weak spots in the compiler that can be improved.

David


Reply via email to