Re: Possible memory access optimization opportunity? Comparison with Java JIT

2025-05-25 Thread David Brown via Gcc

On 24/05/2025 19:56, Andy via Gcc wrote:

Dear GCC developers,

I would like to ask whether there might be room for improvement in memory
access optimization in GCC.

I've prepared a simple benchmark in both C++ (using -std=c++20 for digit
separators like 5'000'000) and Java. The benchmark allocates a large array of
random integers, performs a multiply-sum loop to prevent
dead-code elimination, and measures the time taken to sort the array.
C++ example:



When you are trying to do benchmarking, be careful about what you are 
measuring.  You say you are looking at "memory access optimisation", but 
you are measuring the speed of std::sort().  Obviously memory accesses 
will be relevant to std::sort(), but they are most certainly not the 
only factor - especially when comparing to a different sort in a very 
different kind of language.  Some of the possible differences between 
your Java and your C++ include:


1. Java and your C++ library could have significantly different sort() 
algorithms.  Maybe the Java implementation is better suited to sizes 
like 5 million, while the C++ implementation could be prioritising very 
different sizes.


2. Java's JIT compiler will target the specific processor you are using. 
 Without -march flags, gcc will target a generic common subset of the 
architecture you are using.  Thus Java will use AVX512 and whatever 
other useful features your processor might have - gcc will not. 
Optimising code without using -march is often a waste of time - 
"-march=native" can make much more of a difference than using "-O3" over 
"-O2".  (And sometimes "-O2" gives faster code than "-O3" - extreme 
optimisation is more of an art than a science.)


3. In the same line, there are SIMD libraries that can give massive 
improvements in sorting speed on the right processor.  If the Java 
implementation uses them, and the C++ implementation does not, then you 
are learning absolutely nothing about memory access efficiency with gcc 
- all you are learning is that these libraries are really impressive.


4. Your benchmarking is relying on luck and the limitations of the 
compiler's optimiser in an attempt to measure what you want to measure. 
That's never a good plan - either in C++ or in Java.  You clearly have 
some understanding that you need to do /something/ to stop code being 
eliminated, but you are missing many other things.  Again, writing good 
benchmarks is hard.


gcc is a smart compiler, and gets smarter with each version.  If you use 
more powerful optimisation techniques - like LTO - you will get more 
useful timing results of the actual work being done, and the speed of 
the code for people who are using more powerful optimisation.  (And if 
you care a lot about the speed of a piece of code, then you should want 
to use the most powerful optimisation techniques available.)  But that 
will also help the compiler see that much of what you are doing, does 
not produce any useful results - and can therefore be eliminated.


You are trying to calculate a multiply-sum value for the numbers in the 
vector as a way of forcing the compiler to make the vector.  But since 
"sum" is never used, the compiler can eliminate it - along with all the 
calculations to generate it.


You are sorting the array.  But the compiler might see that you never 
use the sorted array, and thus skip the sorting.  Then it could see that 
you don't use the vector except to sort it, and eliminate that too.  A 
smart enough compiler could eliminate the whole thing - leaving you 
nothing but a couple of calls to now() in your loop.


Then there are the calls to the high_resolution_clock.  My guess is that 
the compiler can't eliminate these or re-order them with respect to each 
other.  But it /can/ re-order them with respect to a lot of the other code.


The key to understanding this is the concept of "observable behaviour" - 
basically, things that affect the input or the output of the program, 
and anything using volatile accesses.  If the compiler can figure that 
changing, removing or reordering code will not affect any observable 
behaviour, it can make those changes in order to give you smaller or 
faster results (time is /not/ "observable" in C and C++).  That means it 
can shuffle around a lot of code involving local variables, and 
functions that it knows do not have observable behaviour.


Your two main tools here are "volatile" and inline assembly, especially 
memory barriers.


If you want to say "ensure that the vector numbers is completely 
calculated here", use :


volatile int vi = 0;
volatile int vx = numbers[vi];

Because "vi" is volatile, the compiler cannot assume it is still 0 by 
the second line.  Because "vx" is volatile, the compiler is required to 
read "numbers[vi]", which could be any element of "numbers", to assign 
it to "vx".  It has to do this, even if "vx" is never used after that.


The other useful tool is :

asm volatile ("" ::: "memory");

This tells the 

gcc-16-20250525 is now available

2025-05-25 Thread GCC Administrator via Gcc
Snapshot gcc-16-20250525 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/16-20250525/
and on various mirrors, see https://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 16 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch master 
revision 2159f024f63c12fd356748ae8fc106bb9b355688

You'll find:

 gcc-16-20250525.tar.xz   Complete GCC

  SHA256=bca61e1bbb5844b86aa5297236c16eb23a938b132330fa8eaf3c78eeb035549e
  SHA1=2046affedbe8c40dc37336bc1be64aa6b9a5b28c

Diffs from 16-20250518 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-16
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.