Re: [PATCH 0/7] openmp: OpenMP 5.1 loop transformation directives

2023-05-22 Thread Jakub Jelinek via Fortran
On Wed, May 17, 2023 at 01:55:00PM +0200, Frederik Harwath wrote:
> Thanks for the explanation. But actually doing this would require a
> complete rewrite which would almost certainly imply that mainline GCC
> would not support the loop transformations for a long time.

I don't think it needs complete rewrite, the change to use
OMP_UNROLL/OMP_TILE should actually simplify stuff when you already have
some other extra construct to handle the clauses if it isn't nested into
something else, so I wouldn't expect it needs more than 2-3 hours of work.
It is true that doing the transformation on trees rather than high gimple
is something different, but again it doesn't require everything to be
rewritten and we have code to do code copying both on trees and high and low
gimple in tree-inline.cc, so the unrolling can just use different APIs
to perform it.

I'd still prefer to do it like that, I think it will pay back in
maintainance costs.

If you don't get to this within say 2 weeks, I'll try to do the conversion
myself.

Jakub



Advice with finding speed between O2 and O3

2023-05-22 Thread Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] via Fortran
All,

Recently, one of the computing centers I run on updated their OS. And in that 
update, the model went from "working with GNU" to "crashing with GNU". No code 
change on our side, just OS.

Some experimenting later and I found that the code did run with debugging 
options, and it still ran with our "aggressive" options (much of which is due 
to Jerry DeLisle from here). Only our release flags failed. Surprising since 
the Aggressive options seem more likely to have issues as they are speed for 
speed's sake (different MPI layouts lead to different answers).

But, one of the main differences are the aggressive flags use -O2 and our 
release flags are -O3. So I test our release flags with -O2 and boom, works 
again! Bad news: much slower.

Our release flags are (essentially):

  -O3 -march=haswell -mtune=generic -funroll-loops -g -fPIC -fopenmp

so we aren't doing anything fancy (portability at the cost of speed).

Staring at the man page I saw this:

   gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
   gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
   diff /tmp/O2-opts /tmp/O3-opts | grep enabled

and when I did that I saw:

$ diff /tmp/O2-opts /tmp/O3-opts | grep enabled
>   -fgcse-after-reload   [enabled]
>   -fipa-cp-clone[enabled]
>   -floop-interchange[enabled]
>   -floop-unroll-and-jam [enabled]
>   -fpeel-loops  [enabled]
>   -fpredictive-commoning[enabled]
>   -fsplit-loops [enabled]
>   -fsplit-paths [enabled]
>   -ftree-loop-distribution  [enabled]
>   -ftree-partial-pre[enabled]
>   -funroll-completely-grow-size [enabled]
>   -funswitch-loops  [enabled]
>   -fversion-loops-for-strides   [enabled]

Now, I'll be doing some experiments, but...that's a lot of tests and rebuilds. 
I was hoping maybe someone here can point me to "this flag is useful for 
Fortran" vs "this doesn't matter".

And maybe which one might be triggered by an OS update? ¯\_(ツ)_/¯

Thanks,
Matt
--
Matt Thompson, SSAI, Ld Scientific Programmer/Analyst
NASA GSFC,Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson


Re: Advice with finding speed between O2 and O3

2023-05-22 Thread Thomas Koenig via Fortran

Hi Matt,


Recently, one of the computing centers I run on updated their  > OS. And in that update, 
the model went from "working with GNU"
to "crashing with GNU". No code change on our side, just OS.


That sounds suspicious, and points to possible bugs in the
code.

Hmm... does the upgrade mean another compiler version?
That could break things, one way or another.  Which
version were you using on the old system, and which one are
you using now?

Does code compiled on the old system still work?

In your case, I would try out whatever debugging options you have
at your disposal, to find the culprit(s).  Use -fcheck=all.
Link with -static-libgfortran to make sure the right library
is used.  Use -fsanitize=undefined and -fsanitize=address. Run
your code under valgrind. Use another compiler (nagfor is excellent
at finding bugs with its catch-all debug option). Use -finit-real=NAN.
Use -Wall -Werror and look at the warnings. Use LTO to find mismatches
in code, or concatenate the whole source into one file and compile
it (never versions of gfortran will then issue errors on suspect code).





Some experimenting later and I found that the code did run with debugging > options, and it still ran with our "aggressive" options (much of 
which> is due to Jerry DeLisle from here). Only our release flags 
failed.> Surprising since the Aggressive options seem more likely to 
have issues> as they are speed for speed's sake (different MPI layouts 
lead to different> answers).


I've never used MPI, but what you describe also sounds suspicious;
maybe some sort of race condition in the code?

But, one of the main differences are the aggressive flags use -O2 > and our release flags are -O3. So I test our release flags with> -O2 

and boom, works again! Bad news: much slower.


Our release flags are (essentially):

   -O3 -march=haswell -mtune=generic -funroll-loops -g -fPIC -fopenmp

so we aren't doing anything fancy (portability at the cost of speed).

Staring at the man page I saw this:

gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled

and when I did that I saw:

$ diff /tmp/O2-opts /tmp/O3-opts | grep enabled

   -fgcse-after-reload   [enabled]
   -fipa-cp-clone[enabled]
   -floop-interchange[enabled]
   -floop-unroll-and-jam [enabled]
   -fpeel-loops  [enabled]
   -fpredictive-commoning[enabled]
   -fsplit-loops [enabled]
   -fsplit-paths [enabled]
   -ftree-loop-distribution  [enabled]
   -ftree-partial-pre[enabled]
   -funroll-completely-grow-size [enabled]
   -funswitch-loops  [enabled]
   -fversion-loops-for-strides   [enabled]


Now, I'll be doing some experiments, but...that's a lot > of tests and rebuilds. I was hoping maybe someone here> can point me 

to "this flag is useful for Fortran"

I think -floop-interchange has little effect on Fortran,
there is a PR on it, IIRC.

If you want to test, a binary search could help.


vs "this doesn't matter".

And maybe which one might be triggered by an OS update? ¯\_(ツ)_/¯


Was the compiler also upgraded, or was it just the kernel?
Like I wrote above, different compiler versions may well
cause problems, which is why there is a porting_to.html
file for every gcc release.  The newest one can be found
here: https://gcc.gnu.org/gcc-13/porting_to.html

Best regards

Thomas


Re: [PATCH v4] libgfortran: Replace mutex with rwlock

2023-05-22 Thread Zhu, Lipeng via Fortran




On 5/16/2023 3:08 PM, Zhu, Lipeng wrote:



On 5/9/2023 10:32 AM, Zhu, Lipeng wrote:



On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote:

On Mon,  8 May 2023 17:44:43 +0800
Lipeng Zhu  wrote:


This patch try to introduce the rwlock and split the read/write to
unit_root tree and unit_cache with rwlock instead of the mutex to
increase CPU efficiency. In the get_gfc_unit function, the percentage
to step into the insert_unit function is around 30%, in most
instances, we can get the unit in the phase of reading the unit_cache
or unit_root tree. So split the read/write phase by rwlock would be an
approach to make it more parallel.

BTW, the IPC metrics can gain around 9x in our test server with 220
cores. The benchmark we used is https://github.com/rwesson/NEAT


See commentary typos below.
You did not state if you regression tested the patch?
I use valgrind --tool=helgrind or --tool=drd to test 'make 
check-fortran'. Is it necessary to add an additional unit test for 
this patch?



Other than that it LGTM but i cannot approve it.
Thank you for your kind help for this patch, is there anything that I 
can do or can you help to push this patch forward?



Hi Bernhard,

Is there any other refinement that need I to do for this patch?

Thanks.




May I know any comment or concern on this patch, thanks for your time :)




diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h index
ad226c8e856..0033cc74252 100644
--- a/libgfortran/io/async.h
+++ b/libgfortran/io/async.h
@@ -210,6 +210,128 @@
  DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s 
%78p\n", aio_prefix, #mutex,

mutex); \

Thanks, corrected in Patch v5.


    } while (0)
+#ifdef __GTHREAD_RWLOCK_INIT
+#define RWLOCK_DEBUG_ADD(rwlock) do {    \
+    aio_rwlock_debug *n;    \
+    n = xmalloc (sizeof(aio_rwlock_debug));    \


Missing space before the open brace: sizeof (


Thanks, corrected in Patch v5.


diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index
82664dc5f98..62f1db21d34 100644
--- a/libgfortran/io/unit.c
+++ b/libgfortran/io/unit.c
@@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME
respectively.  If not, see
  /* IO locking rules:
-   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and 
UNIT_CACHE.
+   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and 
UNIT_CACHE.

+   And use the rwlock to spilt read and write phase to UNIT_ROOT tree
+   and UNIT_CACHE to increase CPU efficiency.


s/spilt/split. Maybe:

Using an rwlock improves efficiency by allowing us to separate 
readers and writers of both UNIT_ROOT

and UNIT_CACHE.


Thanks, corrected in Patch v5.


@@ -350,6 +356,17 @@ retry:
    if (c == 0)
  break;
  }
+  /* We did not find a unit in the cache nor in the unit list, 
create a new

+    (locked) unit and insert into the unit list and cache.
+    Manipulating either or both the unit list and the unit cache 
requires to

+    hold a write-lock [for obvious reasons]:
+    1. By separating the read/write lock, it will greatly reduce 
the contention
+   at the read part, while write part is not always necessary 
or most

+   unlikely once the unit hit in cache.


+    By separating the read/write lock, we will greatly reduce the 
contention
+    on the read part, while the write part is unlikely once the unit 
hits

+    the cache.

+    2. We try to balance the implementation complexity and the 
performance

+   gains that fit into current cases we observed by just using a
+   pthread_rwlock. */


Let's drop 2.


Got it, thanks!

thanks,