Re: Possible memory access optimization opportunity? Comparison with Java JIT

David Brown via Gcc Sun, 25 May 2025 05:22:08 -0700

On 24/05/2025 19:56, Andy via Gcc wrote:

Dear GCC developers,


I would like to ask whether there might be room for improvement in memory
access optimization in GCC.

I've prepared a simple benchmark in both C++ (using -std=c++20 for digit
separators like 5'000'000) and Java. The benchmark allocates a large array of
random integers, performs a multiply-sum loop to prevent
dead-code elimination, and measures the time taken to sort the array.
C++ example:

When you are trying to do benchmarking, be careful about what you aremeasuring. You say you are looking at "memory access optimisation", butyou are measuring the speed of std::sort(). Obviously memory accesseswill be relevant to std::sort(), but they are most certainly not theonly factor - especially when comparing to a different sort in a verydifferent kind of language. Some of the possible differences betweenyour Java and your C++ include:

1. Java and your C++ library could have significantly different sort()algorithms. Maybe the Java implementation is better suited to sizeslike 5 million, while the C++ implementation could be prioritising verydifferent sizes.

2. Java's JIT compiler will target the specific processor you are using.Without -march flags, gcc will target a generic common subset of thearchitecture you are using. Thus Java will use AVX512 and whateverother useful features your processor might have - gcc will not.Optimising code without using -march is often a waste of time -"-march=native" can make much more of a difference than using "-O3" over"-O2". (And sometimes "-O2" gives faster code than "-O3" - extremeoptimisation is more of an art than a science.)

3. In the same line, there are SIMD libraries that can give massiveimprovements in sorting speed on the right processor. If the Javaimplementation uses them, and the C++ implementation does not, then youare learning absolutely nothing about memory access efficiency with gcc- all you are learning is that these libraries are really impressive.

4. Your benchmarking is relying on luck and the limitations of thecompiler's optimiser in an attempt to measure what you want to measure.That's never a good plan - either in C++ or in Java. You clearly havesome understanding that you need to do /something/ to stop code beingeliminated, but you are missing many other things. Again, writing goodbenchmarks is hard.

gcc is a smart compiler, and gets smarter with each version. If you usemore powerful optimisation techniques - like LTO - you will get moreuseful timing results of the actual work being done, and the speed ofthe code for people who are using more powerful optimisation. (And ifyou care a lot about the speed of a piece of code, then you should wantto use the most powerful optimisation techniques available.) But thatwill also help the compiler see that much of what you are doing, doesnot produce any useful results - and can therefore be eliminated.

You are trying to calculate a multiply-sum value for the numbers in thevector as a way of forcing the compiler to make the vector. But since"sum" is never used, the compiler can eliminate it - along with all thecalculations to generate it.

You are sorting the array. But the compiler might see that you neveruse the sorted array, and thus skip the sorting. Then it could see thatyou don't use the vector except to sort it, and eliminate that too. Asmart enough compiler could eliminate the whole thing - leaving younothing but a couple of calls to now() in your loop.

Then there are the calls to the high_resolution_clock. My guess is thatthe compiler can't eliminate these or re-order them with respect to eachother. But it /can/ re-order them with respect to a lot of the other code.

The key to understanding this is the concept of "observable behaviour" -basically, things that affect the input or the output of the program,and anything using volatile accesses. If the compiler can figure thatchanging, removing or reordering code will not affect any observablebehaviour, it can make those changes in order to give you smaller orfaster results (time is /not/ "observable" in C and C++). That means itcan shuffle around a lot of code involving local variables, andfunctions that it knows do not have observable behaviour.

Your two main tools here are "volatile" and inline assembly, especiallymemory barriers.

If you want to say "ensure that the vector numbers is completelycalculated here", use :


        volatile int vi = 0;
        volatile int vx = numbers[vi];

Because "vi" is volatile, the compiler cannot assume it is still 0 bythe second line. Because "vx" is volatile, the compiler is required toread "numbers[vi]", which could be any element of "numbers", to assignit to "vx". It has to do this, even if "vx" is never used after that.


The other useful tool is :

        asm volatile ("" ::: "memory");

This tells the compiler that during the inline assembly, anything thatwas supposed to be in memory might be read and might be changed. Ittherefore blocks all sorts of re-arrangements and code eliminations, andflushes any data held in registers. Small local variables, such as loopindices, will be unaffected.

None of this gives an answer as to whether gcc is generating code thataccesses memory efficiently or not. But it might help you get a clearerpicture, and that in turn might help the gcc developers find weak spotsin the compiler that can be improved.


David

Re: Possible memory access optimization opportunity? Comparison with Java JIT

Reply via email to