date:20210117

RE: [RFC] A memory gathering optimization for loop

2021-01-17 Thread Richard Biener via Gcc

On January 17, 2021 2:23:55 AM GMT+01:00, JiangNing OS via Gcc 
 wrote:
>> -Original Message-
>> From: Feng Xue OS 
>> Sent: Thursday, January 14, 2021 12:28 PM
>> To: gcc@gcc.gnu.org
>> Cc: JiangNing OS ; Hao Liu OS
>> 
>> Subject: [RFC] A memory gathering optimization for loop
>> 
>> 1. Background
>> 
>> In a loop, it is optimal if only one memory stream is activated, that
>is, all
>> memory operations sequentially access one data region. But that is
>always
>> not the case, such as traversing link list and manipulating discrete
>arrays. In
>> this scenario, the loop would contain multiple scattered memory
>accesses,
>> which could trigger inefficient multi-way hardware prefetching, thus
>making
>> cpu cache hit rate drop. The tracker
>> pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
>> description on one related problem that we are meant to target.
>> 
>> 2. Memory gathering optimization (MGO)
>> 
>> For nested loops, if scattered memory accesses inside inner loop
>remain
>> unchanged in each iteration of outer loop, we can dynamically gather
>result
>> data into a sequential memory region when the first access in the
>inner loop
>> takes place. This way the hardware prefetching can be reduced, and
>cpu
>> cache hit rate can be improved. We name this optimization MGO (memory
>> gathering optimization). An example is given as below, which is a
>little
>> different from pr98598, and could not be optimized by loop
>distribution.
>> 
>>   struct A { int **p1; };
>> 
>>   int foo (int n, struct A *array)
>>   {
>>   int sum = 0;
>> 
>>   for (int i = 0; i < n; i++) {
>>   for (int j = 0; j < i; j++) {  /* iteration count is i */
>>   int **p1 = array[j].p1;
>>   int  *p2 = *p1;
>> 
>>   sum += *p2;
>>   }
>>   }
>> 
>>   return sum;
>>   }
>> 

The number of memory streams in this loop is limited (practically 1), how would 
the transform be affected when issuing prefetches for array at a well placed 
spot? 

That is, what's the hazard we avoid (besides the obvious data dependence
Shortening in case the load from array is in cache?) 

Richard. 

>> Although this example seems to be pretty artificial, the kind of data
>reuse in
>> nested loops is often seen in real applications, especially written
>by Fortran,
>> and C++ STL.
>> 
>> To simplify explanation of this memory gathering optimization, pseudo
>code
>> on original nested loops is given as:
>> 
>>   outer-loop ( ) { /* data in inner loop are also invariant in outer
>loop.  */
>>   ...
>>   inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>> 
>>   Type1 v1 = LOAD_FN1 (iter);
>>   Type2 v2 = LOAD_FN2 (v1);
>>   Type3 v3 = LOAD_FN3 (v2);
>>   ...
>>   iter = NEXT_FN (iter);
>>   }
>> 
>>  ...
>>   }
>> 
>> "LOAD_FNx()" is a conceptual function that translates its argument to
>a result,
>> and is composed of a sequence of operations in which there is only
>one
>> memory load. It is capable of representing simple memory dereference
>like
>> "iter->field", or more complicated expression like "array[iter * 2 +
>cst].field".
>> 
>> Likewise, "NEXT_FN()" is also a conceptual function which defines how
>an
>> iterator is advanced. It is able to describe simple integer iterator,
>list iterator,
>> or even pure-function-based iterator on advanced stuff like
>hashset/tree.
>> 
>> Then the code will be transformed to:
>> 
>>   typedef struct cache_elem {
>>   bool   init;
>>   Type1  c_v1;
>>   Type2  c_v2;
>>   Type3  c_v3;
>>   } cache_elem;
>> 
>>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>> 
>>   outer-loop () {
>>   ...
>>   size_t cache_idx = 0;
>> 
>>   inner-loop (iter, iter_count) {
>> 
>>   if (!cache_arr[cache_idx]->init) {
>>   v1 = LOAD_FN1 (iter);
>>   v2 = LOAD_FN2 (v1);
>>   v3 = LOAD_FN3 (v2);
>> 
>>   cache_arr[cache_idx]->init = true;
>>   cache_arr[cache_idx]->c_v1 = v1;
>>   cache_arr[cache_idx]->c_v2 = v2;
>>   cache_arr[cache_idx]->c_v3 = v3;
>>   }
>>   else {
>>   v1 = cache_arr[cache_idx]->c_v1;
>>   v2 = cache_arr[cache_idx]->c_v2;
>>   v3 = cache_arr[cache_idx]->c_v3;
>>   }
>>   ...
>>   cache_idx++;
>>   iter = NEXT_FN (iter);
>>   }
>>   ...
>>   }
>> 
>>   free (cache_arr);
>
>The example in PR#98598 preloads the data before the outer loop, while
>the
>transformation in this RFC lazily caches the data in the inner loop,
>which is to
>avoid introducing unnecessary data RW hazards.
>
>> 
>> In the transformation, cache space belongs to compiler-generated
>temporary
>> storage, but it is from user heap. We find this trick is very
>uncommon in gcc.
>> We'd like to make sure whether there are any special considerations
>or
>> restrictions on

Building gcc for C and C++ with a custom glibc

2021-01-17 Thread Tom Honermann via Gcc

Hi all.  I've been trying to build a custom gcc (trunk) with a custom 
glibc (trunk) with support for C and C++ on x86_64 Linux and have so far 
been unsuccessful at identifying a sequence of configure/make 
invocations that completes successfully.  I'm not trying to build a 
cross compiler.


The scenario I'm trying to satisfy is testing some changes to gcc, and 
additional changes to libstdc++ that have new autoconf and header 
dependencies on the presence of new functions in existing glibc headers.


The glibc installation I'm trying to use was built with:

   mkdir glibc-build
   cd glibc-build
   ../glibc/configure \
    CC=gcc \
    CXX=g++ \
    --prefix /.../glibc
   make && make install

For gcc, I've tried numerous variants of the following:

   mkdir gcc-build
   cd gcc-build
   ../gcc/configure \
    CC=gcc \
    CXX=g++ \
    --prefix /.../gcc \
    --disable-multilib \
    --enable-languages=c,c++ \
    --enable-checking=release \
    --disable-bootstrap \
    --disable-lto

Things I've tried include setting CFLAGS, CXXFLAGS, and LDFLAGS to 
specify additional glibc paths, to specify alternate paths (via 
-nostdinc/-nostdinc++), setting LIBRARY_PATH and CPATH, passing 
--with-sysroot, passing --with-headers and --with-libs, passing 
--with-native-system-header-dir, some of those in conjunction with 
removing --disable-bootstrap, and wrapping gcc in a script that attempts 
to substitute certain include paths.


The problem I'm having is that, in every attempt, the glibc headers and 
libraries from under /usr end up getting used instead of the custom 
glibc ones at some point leading to build failures.


Does anyone have a recipe available for doing this?

Tom.

Re: Building gcc for C and C++ with a custom glibc

2021-01-17 Thread H.J. Lu via Gcc

On Sun, Jan 17, 2021 at 1:06 PM Tom Honermann via Gcc  wrote:
>
> Hi all.  I've been trying to build a custom gcc (trunk) with a custom
> glibc (trunk) with support for C and C++ on x86_64 Linux and have so far
> been unsuccessful at identifying a sequence of configure/make
> invocations that completes successfully.  I'm not trying to build a
> cross compiler.
>
> The scenario I'm trying to satisfy is testing some changes to gcc, and
> additional changes to libstdc++ that have new autoconf and header
> dependencies on the presence of new functions in existing glibc headers.
>
> The glibc installation I'm trying to use was built with:
>
> mkdir glibc-build
> cd glibc-build
> ../glibc/configure \
>  CC=gcc \
>  CXX=g++ \
>  --prefix /.../glibc
> make && make install
>
> For gcc, I've tried numerous variants of the following:
>
> mkdir gcc-build
> cd gcc-build
> ../gcc/configure \
>  CC=gcc \
>  CXX=g++ \
>  --prefix /.../gcc \
>  --disable-multilib \
>  --enable-languages=c,c++ \
>  --enable-checking=release \
>  --disable-bootstrap \
>  --disable-lto
>
> Things I've tried include setting CFLAGS, CXXFLAGS, and LDFLAGS to
> specify additional glibc paths, to specify alternate paths (via
> -nostdinc/-nostdinc++), setting LIBRARY_PATH and CPATH, passing
> --with-sysroot, passing --with-headers and --with-libs, passing
> --with-native-system-header-dir, some of those in conjunction with
> removing --disable-bootstrap, and wrapping gcc in a script that attempts
> to substitute certain include paths.
>
> The problem I'm having is that, in every attempt, the glibc headers and
> libraries from under /usr end up getting used instead of the custom
> glibc ones at some point leading to build failures.
>
> Does anyone have a recipe available for doing this?

Try scripts/build-many-glibcs.py in glibc source.

-- 
H.J.

gcc-11-20210117 is now available

2021-01-17 Thread GCC Administrator via Gcc

Snapshot gcc-11-20210117 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/11-20210117/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 11 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch master 
revision 0f4c8f517b7954e113afb4d5c7212123c8ee2418

You'll find:

 gcc-11-20210117.tar.xz   Complete GCC

  SHA256=e52e24bd7a6950a795e3262d185117bd6e7373079c7e9a0bd85a736a827e3654
  SHA1=81c1818b18cf066c3f0d2d610aacaa0dfc9e8f2b

Diffs from 11-20210110 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-11
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.

Re: [RFC] A memory gathering optimization for loop

2021-01-17 Thread Feng Xue OS via Gcc

>> -Original Message-
>> From: Feng Xue OS 
>> Sent: Thursday, January 14, 2021 12:28 PM
>> To: gcc@gcc.gnu.org
>> Cc: JiangNing OS ; Hao Liu OS
>> 
>> Subject: [RFC] A memory gathering optimization for loop
>>
>> 1. Background
>>
>> In a loop, it is optimal if only one memory stream is activated, that is, all
>> memory operations sequentially access one data region. But that is always
>> not the case, such as traversing link list and manipulating discrete arrays. 
>> In
>> this scenario, the loop would contain multiple scattered memory accesses,
>> which could trigger inefficient multi-way hardware prefetching, thus making
>> cpu cache hit rate drop. The tracker pr98598 
>> (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598)
>> gives a description on one related problem that we are meant to target.
>>
>> 2. Memory gathering optimization (MGO)
>>
>> For nested loops, if scattered memory accesses inside inner loop remain
>> unchanged in each iteration of outer loop, we can dynamically gather result
>> data into a sequential memory region when the first access in the inner loop
>> takes place. This way the hardware prefetching can be reduced, and cpu
>> cache hit rate can be improved. We name this optimization MGO (memory
>> gathering optimization). An example is given as below, which is a little
>> different from pr98598, and could not be optimized by loop distribution.
>>
>>   struct A { int **p1; };
>>
>>   int foo (int n, struct A *array)
>>   {
>>   int sum = 0;
>>
>>   for (int i = 0; i < n; i++) {
>>   for (int j = 0; j < i; j++) {  /* iteration count is i */
>>   int **p1 = array[j].p1;
>>   int  *p2 = *p1;
>>
>>   sum += *p2;
>>   }
>>   }
>>
>>   return sum;
>>   }
>>
>>


> The number of memory streams in this loop is limited (practically 1), how 
> would the transform be affected when issuing prefetches for array at a well 
> placed spot?
>
> That is, what's the hazard we avoid (besides the obvious data dependence
> Shortening in case the load from array is in cache?)
>
> Richard.
>

In the example, to evaluate "sum += *p2" at each iteration, three memory
loads will be involved, each of them forms a memory stream (then totally
3 at least, not 1) since they touch three discrete regions. Spatial locality
only exists for the first load "array[j].p1", some kind of prefetch either via
hardware or software works well for it, while for other two, prefetch would
be of not use and cpu cache could not speculate where the other two loads
will go.

So intention of mgo is to put those discontinuous and unpredictable
memory accesses together to form one new sequential memory stream
in favor of cpu cache.

And if we statically find hazard due to certain write dependence, mgo
will not be applied.

Thanks,
Feng

>> Although this example seems to be pretty artificial, the kind of data reuse 
>> in
>> nested loops is often seen in real applications, especially written by 
>> Fortran,
>> and C++ STL.
>>
>> To simplify explanation of this memory gathering optimization, pseudo code
>> on original nested loops is given as:
>>
>>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  
>> */
>>   ...
>>   inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>>
>>   Type1 v1 = LOAD_FN1 (iter);
>>   Type2 v2 = LOAD_FN2 (v1);
>>   Type3 v3 = LOAD_FN3 (v2);
>>   ...
>>   iter = NEXT_FN (iter);
>>   }
>>
>>  ...
>>   }
>>
>> "LOAD_FNx()" is a conceptual function that translates its argument to a 
>> result,
>> and is composed of a sequence of operations in which there is only one
>> memory load. It is capable of representing simple memory dereference like
>> "iter->field", or more complicated expression like "array[iter * 2 + 
>> cst].field".
>>
>> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
>> iterator is advanced. It is able to describe simple integer iterator, list 
>> iterator,
>> or even pure-function-based iterator on advanced stuff like hashset/tree.
>>
>> Then the code will be transformed to:
>>
>>   typedef struct cache_elem {
>>   bool   init;
>>   Type1  c_v1;
>>   Type2  c_v2;
>>   Type3  c_v3;
>>   } cache_elem;
>>
>>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>>
>>   outer-loop () {
>>   ...
>>   size_t cache_idx = 0;
>>
>>   inner-loop (iter, iter_count) {
>>
>>   if (!cache_arr[cache_idx]->init) {
>>   v1 = LOAD_FN1 (iter);
>>   v2 = LOAD_FN2 (v1);
>>   v3 = LOAD_FN3 (v2);
>>
>>   cache_arr[cache_idx]->init = true;
>>   cache_arr[cache_idx]->c_v1 = v1;
>>   cache_arr[cache_idx]->c_v2 = v2;
>>   cache_arr[cache_idx]->c_v3 = v3;
>>   }
>>   else {
>>   v1 = cache_arr[cache_idx]->c_v1;
>>   v2 = cache_arr[cache_idx]->c_v2;
>>   v3

RE: [RFC] A memory gathering optimization for loop

Building gcc for C and C++ with a custom glibc

Re: Building gcc for C and C++ with a custom glibc

gcc-11-20210117 is now available

Re: [RFC] A memory gathering optimization for loop

5 matches

Site Navigation

Mail list logo

Footer information