RE: [RFC] A memory gathering optimization for loop
On January 17, 2021 2:23:55 AM GMT+01:00, JiangNing OS via Gcc wrote: >> -Original Message- >> From: Feng Xue OS >> Sent: Thursday, January 14, 2021 12:28 PM >> To: gcc@gcc.gnu.org >> Cc: JiangNing OS ; Hao Liu OS >> >> Subject: [RFC] A memory gathering optimization for loop >> >> 1. Background >> >> In a loop, it is optimal if only one memory stream is activated, that >is, all >> memory operations sequentially access one data region. But that is >always >> not the case, such as traversing link list and manipulating discrete >arrays. In >> this scenario, the loop would contain multiple scattered memory >accesses, >> which could trigger inefficient multi-way hardware prefetching, thus >making >> cpu cache hit rate drop. The tracker >> pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a >> description on one related problem that we are meant to target. >> >> 2. Memory gathering optimization (MGO) >> >> For nested loops, if scattered memory accesses inside inner loop >remain >> unchanged in each iteration of outer loop, we can dynamically gather >result >> data into a sequential memory region when the first access in the >inner loop >> takes place. This way the hardware prefetching can be reduced, and >cpu >> cache hit rate can be improved. We name this optimization MGO (memory >> gathering optimization). An example is given as below, which is a >little >> different from pr98598, and could not be optimized by loop >distribution. >> >> struct A { int **p1; }; >> >> int foo (int n, struct A *array) >> { >> int sum = 0; >> >> for (int i = 0; i < n; i++) { >> for (int j = 0; j < i; j++) { /* iteration count is i */ >> int **p1 = array[j].p1; >> int *p2 = *p1; >> >> sum += *p2; >> } >> } >> >> return sum; >> } >> The number of memory streams in this loop is limited (practically 1), how would the transform be affected when issuing prefetches for array at a well placed spot? That is, what's the hazard we avoid (besides the obvious data dependence Shortening in case the load from array is in cache?) Richard. >> Although this example seems to be pretty artificial, the kind of data >reuse in >> nested loops is often seen in real applications, especially written >by Fortran, >> and C++ STL. >> >> To simplify explanation of this memory gathering optimization, pseudo >code >> on original nested loops is given as: >> >> outer-loop ( ) { /* data in inner loop are also invariant in outer >loop. */ >> ... >> inner-loop (iter, iter_count) { /* inner loop to apply MGO */ >> >> Type1 v1 = LOAD_FN1 (iter); >> Type2 v2 = LOAD_FN2 (v1); >> Type3 v3 = LOAD_FN3 (v2); >> ... >> iter = NEXT_FN (iter); >> } >> >> ... >> } >> >> "LOAD_FNx()" is a conceptual function that translates its argument to >a result, >> and is composed of a sequence of operations in which there is only >one >> memory load. It is capable of representing simple memory dereference >like >> "iter->field", or more complicated expression like "array[iter * 2 + >cst].field". >> >> Likewise, "NEXT_FN()" is also a conceptual function which defines how >an >> iterator is advanced. It is able to describe simple integer iterator, >list iterator, >> or even pure-function-based iterator on advanced stuff like >hashset/tree. >> >> Then the code will be transformed to: >> >> typedef struct cache_elem { >> bool init; >> Type1 c_v1; >> Type2 c_v2; >> Type3 c_v3; >> } cache_elem; >> >> cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem)); >> >> outer-loop () { >> ... >> size_t cache_idx = 0; >> >> inner-loop (iter, iter_count) { >> >> if (!cache_arr[cache_idx]->init) { >> v1 = LOAD_FN1 (iter); >> v2 = LOAD_FN2 (v1); >> v3 = LOAD_FN3 (v2); >> >> cache_arr[cache_idx]->init = true; >> cache_arr[cache_idx]->c_v1 = v1; >> cache_arr[cache_idx]->c_v2 = v2; >> cache_arr[cache_idx]->c_v3 = v3; >> } >> else { >> v1 = cache_arr[cache_idx]->c_v1; >> v2 = cache_arr[cache_idx]->c_v2; >> v3 = cache_arr[cache_idx]->c_v3; >> } >> ... >> cache_idx++; >> iter = NEXT_FN (iter); >> } >> ... >> } >> >> free (cache_arr); > >The example in PR#98598 preloads the data before the outer loop, while >the >transformation in this RFC lazily caches the data in the inner loop, >which is to >avoid introducing unnecessary data RW hazards. > >> >> In the transformation, cache space belongs to compiler-generated >temporary >> storage, but it is from user heap. We find this trick is very >uncommon in gcc. >> We'd like to make sure whether there are any special considerations >or >> restrictions on
Building gcc for C and C++ with a custom glibc
Hi all. I've been trying to build a custom gcc (trunk) with a custom glibc (trunk) with support for C and C++ on x86_64 Linux and have so far been unsuccessful at identifying a sequence of configure/make invocations that completes successfully. I'm not trying to build a cross compiler. The scenario I'm trying to satisfy is testing some changes to gcc, and additional changes to libstdc++ that have new autoconf and header dependencies on the presence of new functions in existing glibc headers. The glibc installation I'm trying to use was built with: mkdir glibc-build cd glibc-build ../glibc/configure \ CC=gcc \ CXX=g++ \ --prefix /.../glibc make && make install For gcc, I've tried numerous variants of the following: mkdir gcc-build cd gcc-build ../gcc/configure \ CC=gcc \ CXX=g++ \ --prefix /.../gcc \ --disable-multilib \ --enable-languages=c,c++ \ --enable-checking=release \ --disable-bootstrap \ --disable-lto Things I've tried include setting CFLAGS, CXXFLAGS, and LDFLAGS to specify additional glibc paths, to specify alternate paths (via -nostdinc/-nostdinc++), setting LIBRARY_PATH and CPATH, passing --with-sysroot, passing --with-headers and --with-libs, passing --with-native-system-header-dir, some of those in conjunction with removing --disable-bootstrap, and wrapping gcc in a script that attempts to substitute certain include paths. The problem I'm having is that, in every attempt, the glibc headers and libraries from under /usr end up getting used instead of the custom glibc ones at some point leading to build failures. Does anyone have a recipe available for doing this? Tom.
Re: Building gcc for C and C++ with a custom glibc
On Sun, Jan 17, 2021 at 1:06 PM Tom Honermann via Gcc wrote: > > Hi all. I've been trying to build a custom gcc (trunk) with a custom > glibc (trunk) with support for C and C++ on x86_64 Linux and have so far > been unsuccessful at identifying a sequence of configure/make > invocations that completes successfully. I'm not trying to build a > cross compiler. > > The scenario I'm trying to satisfy is testing some changes to gcc, and > additional changes to libstdc++ that have new autoconf and header > dependencies on the presence of new functions in existing glibc headers. > > The glibc installation I'm trying to use was built with: > > mkdir glibc-build > cd glibc-build > ../glibc/configure \ > CC=gcc \ > CXX=g++ \ > --prefix /.../glibc > make && make install > > For gcc, I've tried numerous variants of the following: > > mkdir gcc-build > cd gcc-build > ../gcc/configure \ > CC=gcc \ > CXX=g++ \ > --prefix /.../gcc \ > --disable-multilib \ > --enable-languages=c,c++ \ > --enable-checking=release \ > --disable-bootstrap \ > --disable-lto > > Things I've tried include setting CFLAGS, CXXFLAGS, and LDFLAGS to > specify additional glibc paths, to specify alternate paths (via > -nostdinc/-nostdinc++), setting LIBRARY_PATH and CPATH, passing > --with-sysroot, passing --with-headers and --with-libs, passing > --with-native-system-header-dir, some of those in conjunction with > removing --disable-bootstrap, and wrapping gcc in a script that attempts > to substitute certain include paths. > > The problem I'm having is that, in every attempt, the glibc headers and > libraries from under /usr end up getting used instead of the custom > glibc ones at some point leading to build failures. > > Does anyone have a recipe available for doing this? Try scripts/build-many-glibcs.py in glibc source. -- H.J.
gcc-11-20210117 is now available
Snapshot gcc-11-20210117 is now available on https://gcc.gnu.org/pub/gcc/snapshots/11-20210117/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 11 git branch with the following options: git://gcc.gnu.org/git/gcc.git branch master revision 0f4c8f517b7954e113afb4d5c7212123c8ee2418 You'll find: gcc-11-20210117.tar.xz Complete GCC SHA256=e52e24bd7a6950a795e3262d185117bd6e7373079c7e9a0bd85a736a827e3654 SHA1=81c1818b18cf066c3f0d2d610aacaa0dfc9e8f2b Diffs from 11-20210110 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-11 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: [RFC] A memory gathering optimization for loop
>> -Original Message- >> From: Feng Xue OS >> Sent: Thursday, January 14, 2021 12:28 PM >> To: gcc@gcc.gnu.org >> Cc: JiangNing OS ; Hao Liu OS >> >> Subject: [RFC] A memory gathering optimization for loop >> >> 1. Background >> >> In a loop, it is optimal if only one memory stream is activated, that is, all >> memory operations sequentially access one data region. But that is always >> not the case, such as traversing link list and manipulating discrete arrays. >> In >> this scenario, the loop would contain multiple scattered memory accesses, >> which could trigger inefficient multi-way hardware prefetching, thus making >> cpu cache hit rate drop. The tracker pr98598 >> (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) >> gives a description on one related problem that we are meant to target. >> >> 2. Memory gathering optimization (MGO) >> >> For nested loops, if scattered memory accesses inside inner loop remain >> unchanged in each iteration of outer loop, we can dynamically gather result >> data into a sequential memory region when the first access in the inner loop >> takes place. This way the hardware prefetching can be reduced, and cpu >> cache hit rate can be improved. We name this optimization MGO (memory >> gathering optimization). An example is given as below, which is a little >> different from pr98598, and could not be optimized by loop distribution. >> >> struct A { int **p1; }; >> >> int foo (int n, struct A *array) >> { >> int sum = 0; >> >> for (int i = 0; i < n; i++) { >> for (int j = 0; j < i; j++) { /* iteration count is i */ >> int **p1 = array[j].p1; >> int *p2 = *p1; >> >> sum += *p2; >> } >> } >> >> return sum; >> } >> >> > The number of memory streams in this loop is limited (practically 1), how > would the transform be affected when issuing prefetches for array at a well > placed spot? > > That is, what's the hazard we avoid (besides the obvious data dependence > Shortening in case the load from array is in cache?) > > Richard. > In the example, to evaluate "sum += *p2" at each iteration, three memory loads will be involved, each of them forms a memory stream (then totally 3 at least, not 1) since they touch three discrete regions. Spatial locality only exists for the first load "array[j].p1", some kind of prefetch either via hardware or software works well for it, while for other two, prefetch would be of not use and cpu cache could not speculate where the other two loads will go. So intention of mgo is to put those discontinuous and unpredictable memory accesses together to form one new sequential memory stream in favor of cpu cache. And if we statically find hazard due to certain write dependence, mgo will not be applied. Thanks, Feng >> Although this example seems to be pretty artificial, the kind of data reuse >> in >> nested loops is often seen in real applications, especially written by >> Fortran, >> and C++ STL. >> >> To simplify explanation of this memory gathering optimization, pseudo code >> on original nested loops is given as: >> >> outer-loop ( ) { /* data in inner loop are also invariant in outer loop. >> */ >> ... >> inner-loop (iter, iter_count) { /* inner loop to apply MGO */ >> >> Type1 v1 = LOAD_FN1 (iter); >> Type2 v2 = LOAD_FN2 (v1); >> Type3 v3 = LOAD_FN3 (v2); >> ... >> iter = NEXT_FN (iter); >> } >> >> ... >> } >> >> "LOAD_FNx()" is a conceptual function that translates its argument to a >> result, >> and is composed of a sequence of operations in which there is only one >> memory load. It is capable of representing simple memory dereference like >> "iter->field", or more complicated expression like "array[iter * 2 + >> cst].field". >> >> Likewise, "NEXT_FN()" is also a conceptual function which defines how an >> iterator is advanced. It is able to describe simple integer iterator, list >> iterator, >> or even pure-function-based iterator on advanced stuff like hashset/tree. >> >> Then the code will be transformed to: >> >> typedef struct cache_elem { >> bool init; >> Type1 c_v1; >> Type2 c_v2; >> Type3 c_v3; >> } cache_elem; >> >> cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem)); >> >> outer-loop () { >> ... >> size_t cache_idx = 0; >> >> inner-loop (iter, iter_count) { >> >> if (!cache_arr[cache_idx]->init) { >> v1 = LOAD_FN1 (iter); >> v2 = LOAD_FN2 (v1); >> v3 = LOAD_FN3 (v2); >> >> cache_arr[cache_idx]->init = true; >> cache_arr[cache_idx]->c_v1 = v1; >> cache_arr[cache_idx]->c_v2 = v2; >> cache_arr[cache_idx]->c_v3 = v3; >> } >> else { >> v1 = cache_arr[cache_idx]->c_v1; >> v2 = cache_arr[cache_idx]->c_v2; >> v3