> -Original Message-
> From: Feng Xue OS
> Sent: Thursday, January 14, 2021 12:28 PM
> To: gcc@gcc.gnu.org
> Cc: JiangNing OS ; Hao Liu OS
>
> Subject: [RFC] A memory gathering optimization for loop
>
> 1. Background
>
> In a loop, it is optimal if only one memory stream is activated, that is, all
> memory operations sequentially access one data region. But that is always
> not the case, such as traversing link list and manipulating discrete arrays.
> In
> this scenario, the loop would contain multiple scattered memory accesses,
> which could trigger inefficient multi-way hardware prefetching, thus making
> cpu cache hit rate drop. The tracker
> pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
> description on one related problem that we are meant to target.
>
> 2. Memory gathering optimization (MGO)
>
> For nested loops, if scattered memory accesses inside inner loop remain
> unchanged in each iteration of outer loop, we can dynamically gather result
> data into a sequential memory region when the first access in the inner loop
> takes place. This way the hardware prefetching can be reduced, and cpu
> cache hit rate can be improved. We name this optimization MGO (memory
> gathering optimization). An example is given as below, which is a little
> different from pr98598, and could not be optimized by loop distribution.
>
> struct A { int **p1; };
>
> int foo (int n, struct A *array)
> {
> int sum = 0;
>
> for (int i = 0; i < n; i++) {
> for (int j = 0; j < i; j++) { /* iteration count is i */
> int **p1 = array[j].p1;
> int *p2 = *p1;
>
> sum += *p2;
> }
> }
>
> return sum;
> }
>
> Although this example seems to be pretty artificial, the kind of data reuse in
> nested loops is often seen in real applications, especially written by
> Fortran,
> and C++ STL.
>
> To simplify explanation of this memory gathering optimization, pseudo code
> on original nested loops is given as:
>
> outer-loop ( ) { /* data in inner loop are also invariant in outer loop. */
> ...
> inner-loop (iter, iter_count) { /* inner loop to apply MGO */
>
> Type1 v1 = LOAD_FN1 (iter);
> Type2 v2 = LOAD_FN2 (v1);
> Type3 v3 = LOAD_FN3 (v2);
> ...
> iter = NEXT_FN (iter);
> }
>
> ...
> }
>
> "LOAD_FNx()" is a conceptual function that translates its argument to a
> result,
> and is composed of a sequence of operations in which there is only one
> memory load. It is capable of representing simple memory dereference like
> "iter->field", or more complicated expression like "array[iter * 2 +
> cst].field".
>
> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
> iterator is advanced. It is able to describe simple integer iterator, list
> iterator,
> or even pure-function-based iterator on advanced stuff like hashset/tree.
>
> Then the code will be transformed to:
>
> typedef struct cache_elem {
> bool init;
> Type1 c_v1;
> Type2 c_v2;
> Type3 c_v3;
> } cache_elem;
>
> cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>
> outer-loop () {
> ...
> size_t cache_idx = 0;
>
> inner-loop (iter, iter_count) {
>
> if (!cache_arr[cache_idx]->init) {
> v1 = LOAD_FN1 (iter);
> v2 = LOAD_FN2 (v1);
> v3 = LOAD_FN3 (v2);
>
> cache_arr[cache_idx]->init = true;
> cache_arr[cache_idx]->c_v1 = v1;
> cache_arr[cache_idx]->c_v2 = v2;
> cache_arr[cache_idx]->c_v3 = v3;
> }
> else {
> v1 = cache_arr[cache_idx]->c_v1;
> v2 = cache_arr[cache_idx]->c_v2;
> v3 = cache_arr[cache_idx]->c_v3;
> }
> ...
> cache_idx++;
> iter = NEXT_FN (iter);
> }
> ...
> }
>
> free (cache_arr);
The example in PR#98598 preloads the data before the outer loop, while the
transformation in this RFC lazily caches the data in the inner loop, which is to
avoid introducing unnecessary data RW hazards.
>
> In the transformation, cache space belongs to compiler-generated temporary
> storage, but it is from user heap. We find this trick is very uncommon in gcc.
> We'd like to make sure whether there are any special considerations or
> restrictions on this. Anyway, using alloca() to get space from stack is an
> alternative, but we should be cautious to avoid incurring stack overflow by
> non-user logic.
>
> "iter_count" stands for an upper bound for inner-loop iteration count, which
> must be an outer-loop invariant expression, in that it determines data count
> in cache space, and the space is allocated before outer-loop.
>
> In the example below, we can simply get such bound for inner-loop 1 from
> its niter_desc, while for inner-loop 2, it is not that eas