Re: [PATCH 0/7] openmp: OpenMP 5.1 loop transformation directives
On Wed, May 17, 2023 at 01:55:00PM +0200, Frederik Harwath wrote: > Thanks for the explanation. But actually doing this would require a > complete rewrite which would almost certainly imply that mainline GCC > would not support the loop transformations for a long time. I don't think it needs complete rewrite, the change to use OMP_UNROLL/OMP_TILE should actually simplify stuff when you already have some other extra construct to handle the clauses if it isn't nested into something else, so I wouldn't expect it needs more than 2-3 hours of work. It is true that doing the transformation on trees rather than high gimple is something different, but again it doesn't require everything to be rewritten and we have code to do code copying both on trees and high and low gimple in tree-inline.cc, so the unrolling can just use different APIs to perform it. I'd still prefer to do it like that, I think it will pay back in maintainance costs. If you don't get to this within say 2 weeks, I'll try to do the conversion myself. Jakub
Advice with finding speed between O2 and O3
All, Recently, one of the computing centers I run on updated their OS. And in that update, the model went from "working with GNU" to "crashing with GNU". No code change on our side, just OS. Some experimenting later and I found that the code did run with debugging options, and it still ran with our "aggressive" options (much of which is due to Jerry DeLisle from here). Only our release flags failed. Surprising since the Aggressive options seem more likely to have issues as they are speed for speed's sake (different MPI layouts lead to different answers). But, one of the main differences are the aggressive flags use -O2 and our release flags are -O3. So I test our release flags with -O2 and boom, works again! Bad news: much slower. Our release flags are (essentially): -O3 -march=haswell -mtune=generic -funroll-loops -g -fPIC -fopenmp so we aren't doing anything fancy (portability at the cost of speed). Staring at the man page I saw this: gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts diff /tmp/O2-opts /tmp/O3-opts | grep enabled and when I did that I saw: $ diff /tmp/O2-opts /tmp/O3-opts | grep enabled > -fgcse-after-reload [enabled] > -fipa-cp-clone[enabled] > -floop-interchange[enabled] > -floop-unroll-and-jam [enabled] > -fpeel-loops [enabled] > -fpredictive-commoning[enabled] > -fsplit-loops [enabled] > -fsplit-paths [enabled] > -ftree-loop-distribution [enabled] > -ftree-partial-pre[enabled] > -funroll-completely-grow-size [enabled] > -funswitch-loops [enabled] > -fversion-loops-for-strides [enabled] Now, I'll be doing some experiments, but...that's a lot of tests and rebuilds. I was hoping maybe someone here can point me to "this flag is useful for Fortran" vs "this doesn't matter". And maybe which one might be triggered by an OS update? ¯\_(ツ)_/¯ Thanks, Matt -- Matt Thompson, SSAI, Ld Scientific Programmer/Analyst NASA GSFC,Global Modeling and Assimilation Office Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771 Phone: 301-614-6712 Fax: 301-614-6246 http://science.gsfc.nasa.gov/sed/bio/matthew.thompson
Re: Advice with finding speed between O2 and O3
Hi Matt, Recently, one of the computing centers I run on updated their > OS. And in that update, the model went from "working with GNU" to "crashing with GNU". No code change on our side, just OS. That sounds suspicious, and points to possible bugs in the code. Hmm... does the upgrade mean another compiler version? That could break things, one way or another. Which version were you using on the old system, and which one are you using now? Does code compiled on the old system still work? In your case, I would try out whatever debugging options you have at your disposal, to find the culprit(s). Use -fcheck=all. Link with -static-libgfortran to make sure the right library is used. Use -fsanitize=undefined and -fsanitize=address. Run your code under valgrind. Use another compiler (nagfor is excellent at finding bugs with its catch-all debug option). Use -finit-real=NAN. Use -Wall -Werror and look at the warnings. Use LTO to find mismatches in code, or concatenate the whole source into one file and compile it (never versions of gfortran will then issue errors on suspect code). Some experimenting later and I found that the code did run with debugging > options, and it still ran with our "aggressive" options (much of which> is due to Jerry DeLisle from here). Only our release flags failed.> Surprising since the Aggressive options seem more likely to have issues> as they are speed for speed's sake (different MPI layouts lead to different> answers). I've never used MPI, but what you describe also sounds suspicious; maybe some sort of race condition in the code? But, one of the main differences are the aggressive flags use -O2 > and our release flags are -O3. So I test our release flags with> -O2 and boom, works again! Bad news: much slower. Our release flags are (essentially): -O3 -march=haswell -mtune=generic -funroll-loops -g -fPIC -fopenmp so we aren't doing anything fancy (portability at the cost of speed). Staring at the man page I saw this: gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts diff /tmp/O2-opts /tmp/O3-opts | grep enabled and when I did that I saw: $ diff /tmp/O2-opts /tmp/O3-opts | grep enabled -fgcse-after-reload [enabled] -fipa-cp-clone[enabled] -floop-interchange[enabled] -floop-unroll-and-jam [enabled] -fpeel-loops [enabled] -fpredictive-commoning[enabled] -fsplit-loops [enabled] -fsplit-paths [enabled] -ftree-loop-distribution [enabled] -ftree-partial-pre[enabled] -funroll-completely-grow-size [enabled] -funswitch-loops [enabled] -fversion-loops-for-strides [enabled] Now, I'll be doing some experiments, but...that's a lot > of tests and rebuilds. I was hoping maybe someone here> can point me to "this flag is useful for Fortran" I think -floop-interchange has little effect on Fortran, there is a PR on it, IIRC. If you want to test, a binary search could help. vs "this doesn't matter". And maybe which one might be triggered by an OS update? ¯\_(ツ)_/¯ Was the compiler also upgraded, or was it just the kernel? Like I wrote above, different compiler versions may well cause problems, which is why there is a porting_to.html file for every gcc release. The newest one can be found here: https://gcc.gnu.org/gcc-13/porting_to.html Best regards Thomas
Re: [PATCH v4] libgfortran: Replace mutex with rwlock
On 5/16/2023 3:08 PM, Zhu, Lipeng wrote: On 5/9/2023 10:32 AM, Zhu, Lipeng wrote: On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote: On Mon, 8 May 2023 17:44:43 +0800 Lipeng Zhu wrote: This patch try to introduce the rwlock and split the read/write to unit_root tree and unit_cache with rwlock instead of the mutex to increase CPU efficiency. In the get_gfc_unit function, the percentage to step into the insert_unit function is around 30%, in most instances, we can get the unit in the phase of reading the unit_cache or unit_root tree. So split the read/write phase by rwlock would be an approach to make it more parallel. BTW, the IPC metrics can gain around 9x in our test server with 220 cores. The benchmark we used is https://github.com/rwesson/NEAT See commentary typos below. You did not state if you regression tested the patch? I use valgrind --tool=helgrind or --tool=drd to test 'make check-fortran'. Is it necessary to add an additional unit test for this patch? Other than that it LGTM but i cannot approve it. Thank you for your kind help for this patch, is there anything that I can do or can you help to push this patch forward? Hi Bernhard, Is there any other refinement that need I to do for this patch? Thanks. May I know any comment or concern on this patch, thanks for your time :) diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h index ad226c8e856..0033cc74252 100644 --- a/libgfortran/io/async.h +++ b/libgfortran/io/async.h @@ -210,6 +210,128 @@ DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \ Thanks, corrected in Patch v5. } while (0) +#ifdef __GTHREAD_RWLOCK_INIT +#define RWLOCK_DEBUG_ADD(rwlock) do { \ + aio_rwlock_debug *n; \ + n = xmalloc (sizeof(aio_rwlock_debug)); \ Missing space before the open brace: sizeof ( Thanks, corrected in Patch v5. diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index 82664dc5f98..62f1db21d34 100644 --- a/libgfortran/io/unit.c +++ b/libgfortran/io/unit.c @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see /* IO locking rules: - UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE. + UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE. + And use the rwlock to spilt read and write phase to UNIT_ROOT tree + and UNIT_CACHE to increase CPU efficiency. s/spilt/split. Maybe: Using an rwlock improves efficiency by allowing us to separate readers and writers of both UNIT_ROOT and UNIT_CACHE. Thanks, corrected in Patch v5. @@ -350,6 +356,17 @@ retry: if (c == 0) break; } + /* We did not find a unit in the cache nor in the unit list, create a new + (locked) unit and insert into the unit list and cache. + Manipulating either or both the unit list and the unit cache requires to + hold a write-lock [for obvious reasons]: + 1. By separating the read/write lock, it will greatly reduce the contention + at the read part, while write part is not always necessary or most + unlikely once the unit hit in cache. + By separating the read/write lock, we will greatly reduce the contention + on the read part, while the write part is unlikely once the unit hits + the cache. + 2. We try to balance the implementation complexity and the performance + gains that fit into current cases we observed by just using a + pthread_rwlock. */ Let's drop 2. Got it, thanks! thanks,