gcc-10-20210116 is now available

2021-01-16 Thread GCC Administrator via Gcc
Snapshot gcc-10-20210116 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/10-20210116/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 10 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch 
releases/gcc-10 revision 5d201f32885d0fd22db1e61394eceb0915b647af

You'll find:

 gcc-10-20210116.tar.xz   Complete GCC

  SHA256=041f4654be6819d400dbff3d77a23851dcc957d39f7435f753f2388799421cd0
  SHA1=9d07a79390335b347e2f65e5f921c4535a39fedc

Diffs from 10-20210109 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-10
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


RE: [RFC] A memory gathering optimization for loop

2021-01-16 Thread JiangNing OS via Gcc
> -Original Message-
> From: Feng Xue OS 
> Sent: Thursday, January 14, 2021 12:28 PM
> To: gcc@gcc.gnu.org
> Cc: JiangNing OS ; Hao Liu OS
> 
> Subject: [RFC] A memory gathering optimization for loop
> 
> 1. Background
> 
> In a loop, it is optimal if only one memory stream is activated, that is, all
> memory operations sequentially access one data region. But that is always
> not the case, such as traversing link list and manipulating discrete arrays. 
> In
> this scenario, the loop would contain multiple scattered memory accesses,
> which could trigger inefficient multi-way hardware prefetching, thus making
> cpu cache hit rate drop. The tracker
> pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
> description on one related problem that we are meant to target.
> 
> 2. Memory gathering optimization (MGO)
> 
> For nested loops, if scattered memory accesses inside inner loop remain
> unchanged in each iteration of outer loop, we can dynamically gather result
> data into a sequential memory region when the first access in the inner loop
> takes place. This way the hardware prefetching can be reduced, and cpu
> cache hit rate can be improved. We name this optimization MGO (memory
> gathering optimization). An example is given as below, which is a little
> different from pr98598, and could not be optimized by loop distribution.
> 
>   struct A { int **p1; };
> 
>   int foo (int n, struct A *array)
>   {
>   int sum = 0;
> 
>   for (int i = 0; i < n; i++) {
>   for (int j = 0; j < i; j++) {  /* iteration count is i */
>   int **p1 = array[j].p1;
>   int  *p2 = *p1;
> 
>   sum += *p2;
>   }
>   }
> 
>   return sum;
>   }
> 
> Although this example seems to be pretty artificial, the kind of data reuse in
> nested loops is often seen in real applications, especially written by 
> Fortran,
> and C++ STL.
> 
> To simplify explanation of this memory gathering optimization, pseudo code
> on original nested loops is given as:
> 
>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
>   ...
>   inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
> 
>   Type1 v1 = LOAD_FN1 (iter);
>   Type2 v2 = LOAD_FN2 (v1);
>   Type3 v3 = LOAD_FN3 (v2);
>   ...
>   iter = NEXT_FN (iter);
>   }
> 
>  ...
>   }
> 
> "LOAD_FNx()" is a conceptual function that translates its argument to a 
> result,
> and is composed of a sequence of operations in which there is only one
> memory load. It is capable of representing simple memory dereference like
> "iter->field", or more complicated expression like "array[iter * 2 + 
> cst].field".
> 
> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
> iterator is advanced. It is able to describe simple integer iterator, list 
> iterator,
> or even pure-function-based iterator on advanced stuff like hashset/tree.
> 
> Then the code will be transformed to:
> 
>   typedef struct cache_elem {
>   bool   init;
>   Type1  c_v1;
>   Type2  c_v2;
>   Type3  c_v3;
>   } cache_elem;
> 
>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
> 
>   outer-loop () {
>   ...
>   size_t cache_idx = 0;
> 
>   inner-loop (iter, iter_count) {
> 
>   if (!cache_arr[cache_idx]->init) {
>   v1 = LOAD_FN1 (iter);
>   v2 = LOAD_FN2 (v1);
>   v3 = LOAD_FN3 (v2);
> 
>   cache_arr[cache_idx]->init = true;
>   cache_arr[cache_idx]->c_v1 = v1;
>   cache_arr[cache_idx]->c_v2 = v2;
>   cache_arr[cache_idx]->c_v3 = v3;
>   }
>   else {
>   v1 = cache_arr[cache_idx]->c_v1;
>   v2 = cache_arr[cache_idx]->c_v2;
>   v3 = cache_arr[cache_idx]->c_v3;
>   }
>   ...
>   cache_idx++;
>   iter = NEXT_FN (iter);
>   }
>   ...
>   }
> 
>   free (cache_arr);

The example in PR#98598 preloads the data before the outer loop, while the
transformation in this RFC lazily caches the data in the inner loop, which is to
avoid introducing unnecessary data RW hazards.

> 
> In the transformation, cache space belongs to compiler-generated temporary
> storage, but it is from user heap. We find this trick is very uncommon in gcc.
> We'd like to make sure whether there are any special considerations or
> restrictions on this. Anyway, using alloca() to get space from stack is an
> alternative, but we should be cautious to avoid incurring stack overflow by
> non-user logic.
> 
> "iter_count" stands for an upper bound for inner-loop iteration count, which
> must be an outer-loop invariant expression, in that it determines data count
> in cache space, and the space is allocated before outer-loop.
> 
> In the example below, we can simply get such bound for inner-loop 1 from
> its niter_desc, while for inner-loop 2, it is not that eas