Undefined behavior or gcc is doing additional good job?
Hi, For below simple example: #include extern uint32_t __bss_start[]; extern uint32_t __data_start[]; void Reset_Handler(void) { /* Clear .bss section (initialize with zeros) */ for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) { *bss_ptr = 0; } } One snapshot of our branch generates: Reset_Handler: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldrr2, .L6 ldrr1, .L6+4 subsr1, r1, r2 bicr1, r1, #3 movsr3, #0 .L2: cmpr3, r1 beq.L5 movsr0, #0 strr0, [r2, r3] addsr3, r3, #4 b.L2 .L5: bxlr .L7: .align2 .L6: .word__bss_start .word__data_start .sizeReset_Handler, .-Reset_Handler I know the IVOPT chooses wrong candidate here, but what I am not sure about is: 0) the original code is not safe. It could result in infinite loop if there is any alignment issue of __bss_ptr and __data_start. 1) GCC explicitly clears the two lower bits of (__bss_ptr - __data_start). This makes the loop safe (from infinite loop). My question is, is it intended for GCC to do such transformation? Thanks, bin -- Best Regards.
Re: Undefined behavior or gcc is doing additional good job?
On Fri, Jan 03, 2014 at 04:12:19PM +0800, Bin.Cheng wrote: > Hi, For below simple example: > #include > > extern uint32_t __bss_start[]; > extern uint32_t __data_start[]; > > void Reset_Handler(void) > { > /* Clear .bss section (initialize with zeros) */ > for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) { > *bss_ptr = 0; > } > } I believe this is undefined behavior, so GCC can assume bss_ptr != __data_start is true always. You need something like memset (__bss_start, 0, (uintptr_t) __data_start - (uintptr_t) __bss_start); (note the cases to non-pointers), then it is just implementation defined behavior. Or do uint32_t data_ptr; asm ("" : "g" (data_ptr) : "0" (__data_start)); for (uint32_t* bss_ptr = __bss_start; bss_ptr != data_ptr; ++bss_ptr) { *bss_ptr = 0; } and thus hide from the compiler the fact that __data_start is in a different object. Jakub
Re: Undefined behavior or gcc is doing additional good job?
On Fri, Jan 3, 2014 at 4:24 PM, Jakub Jelinek wrote: > On Fri, Jan 03, 2014 at 04:12:19PM +0800, Bin.Cheng wrote: >> Hi, For below simple example: >> #include >> >> extern uint32_t __bss_start[]; >> extern uint32_t __data_start[]; >> >> void Reset_Handler(void) >> { >> /* Clear .bss section (initialize with zeros) */ >> for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) { >> *bss_ptr = 0; >> } >> } > > I believe this is undefined behavior, so GCC can assume > bss_ptr != __data_start is true always. You need something like Sorry for posting the premature question. Since both __bss_start and __data_start are declared as array, it seems there is no undefined behavior, the check is between two pointers with same type actually, right? So the question remains, why GCC would clear the two lower bits of " __data_start - __bss_start" then? Am I some stupid mistake? Thanks, bin > memset (__bss_start, 0, (uintptr_t) __data_start - (uintptr_t) __bss_start); > (note the cases to non-pointers), then it is just implementation defined > behavior. Or do > uint32_t data_ptr; > asm ("" : "g" (data_ptr) : "0" (__data_start)); > for (uint32_t* bss_ptr = __bss_start; bss_ptr != data_ptr; ++bss_ptr) { > *bss_ptr = 0; > } > and thus hide from the compiler the fact that __data_start is in a different > object. > > Jakub -- Best Regards.
Re: Undefined behavior or gcc is doing additional good job?
On Fri, Jan 03, 2014 at 04:44:48PM +0800, Bin.Cheng wrote: > >> extern uint32_t __bss_start[]; > >> extern uint32_t __data_start[]; > >> > >> void Reset_Handler(void) > >> { > >> /* Clear .bss section (initialize with zeros) */ > >> for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) > >> { > >> *bss_ptr = 0; > >> } > >> } > > > > I believe this is undefined behavior, so GCC can assume > > bss_ptr != __data_start is true always. You need something like > Sorry for posting the premature question. Since both __bss_start and > __data_start are declared as array, it seems there is no undefined > behavior, the check is between two pointers with same type actually, I think this has been discussed in some PR, unfortunately I can't find it. If it was < or <=, then it would be obvious undefined behavior, those comparisons can't be performed between different objects, the above is questionable, because you still assume that you get through pointer arithmetics from one object to another one, without dereference pointer arithmetics can be at one past last entry in the array, but whether that is equal to the object object is still quite problematic. > right? So the question remains, why GCC would clear the two lower > bits of " __data_start - __bss_start" then? Am I some stupid mistake? That said, if either of __bss_start of __data_start aren't 32-bit aligned, then it is a clear undefined behavior, the masking of low 2 bits (doesn't happen on x86_64) comes from IVopts computing the end as ((__data_start - __bss_start) + 1) * 4 and the __data_start - __bss_start is exact division by 4, apparently we don't fold that back to just (char *) __data_start - (char *) __bss_start + 4. Jakub
Why __builtin_sqrt do not totally replace sqrt in asm
Hi, When the standard pattern 'sqrtm2' is defined I don't understand why calls to sqrt or __builtin_sqrt is always followed by a comparison of the result with itself (checking the NaN ?) and a conditional branch to sqrt symbol (so linking with libm is always mandatory). - mov $FP0,$FP1 fsqrt $FP0, $FP0<< the builtin_sqrt fcompare $FP0,$FP0 << strange compare of the result of builtin_sqrt jmp.ifEQUAL .L2 mov $FP1,$FP0 branch sqrt<< branch to sqrt symbol if $FP0 != $FP0 .L2 - Is there a way to tell GCC that sqrt function is totally handled by __builtin_sqrt ? Regards, Selim
Re: Why __builtin_sqrt do not totally replace sqrt in asm
On Fri, Jan 03, 2014 at 10:44:21AM +0100, BELBACHIR Selim wrote: > When the standard pattern 'sqrtm2' is defined I don't understand why calls > to sqrt or __builtin_sqrt is always followed by a comparison of the result > with itself (checking the NaN ?) and a conditional branch to sqrt symbol > (so linking with libm is always mandatory). Because -fmath-errno is the default and sqrt for negative value (including -Inf) is supposed to set errno. Use -ffast-math or -fno-math-errno if you don't need/want that. On some targets GCC is able to emit code to set errno directly, on others GCC just emits a call to the library function so that it handles errno properly. Jakub
lto testsuite may erase mathlib variable
Hi, I noticed a problem in gcc/testsuite/g++.dg/lto/lto.exp If the target does not support LTO (check_effective_target_lto) a brutal return is performed so the mathlib variable modified in lto_init will not be restored properly by lto_finish at the end of the script. Subsequent testsuites will found an empty mathlib. Regards, Selim patch Description: patch
Re: Undefined behavior or gcc is doing additional good job?
On Fri, 3 Jan 2014, Jakub Jelinek wrote: > I think this has been discussed in some PR, unfortunately I can't find it. Bug 57725? -- Joseph S. Myers jos...@codesourcery.com
LIMITS_H_TEST and Newlib
Hello, in gcc/Makefile, there is a test to determine how to set up the GCC provided limits.h. Here is a collection of the relevant Makefile parts: # # Installation directories # # Common prefix for installation directories. # NOTE: This directory must exist when you start installation. prefix = /the/prefix # Directory in which to put host dependent programs and libraries exec_prefix = ${prefix} # Directory in which to put the directories used by the compiler. libdir = ${exec_prefix}/lib64 # Directory in which the compiler finds libraries etc. libsubdir = $(libdir)/gcc/$(target_noncanonical)/$(version) # Used in install-cross. gcc_tooldir = $(libsubdir)/$(libsubdir_to_prefix)$(target_noncanonical) # Default cross SYSTEM_HEADER_DIR, to be overridden by targets. CROSS_SYSTEM_HEADER_DIR = $(gcc_tooldir)/sys-include # autoconf sets SYSTEM_HEADER_DIR to one of the above. # Purge it of unnecessary internal relative paths # to directories that might not exist yet. # The sed idiom for this is to repeat the search-and-replace until it doesn't match, using :a ... ta. # Use single quotes here to avoid nested double- and backquotes, this # macro is also used in a double-quoted context. SYSTEM_HEADER_DIR = `echo $(CROSS_SYSTEM_HEADER_DIR) | sed -e :a -e 's,[^/]*/\.\.\/,,' -e ta` # Test to see whether exists in the system header files. LIMITS_H_TEST = [ -f $(SYSTEM_HEADER_DIR)/limits.h ] # # Build the include directories. The stamp files are stmp-* rather than # s-* so that mostlyclean does not force the include directory to # be rebuilt. # Build the include directories. stmp-int-hdrs: $(STMP_FIXINC) $(USER_H) fixinc_list # Copy in the headers provided with gcc. # # The sed command gets just the last file name component; # this is necessary because VPATH could add a dirname. # Using basename would be simpler, but some systems don't have it. # # The touch command is here to workaround an AIX/Linux NFS bug. # # The move-if-change + cp -p twists for limits.h are intended to preserve # the time stamp when we regenerate, to prevent pointless rebuilds during # e.g. install-no-fixedincludes. [...] set -e; for ml in `cat fixinc_list`; do \ sysroot_headers_suffix=`echo $${ml} | sed -e 's/;.*$$//'`; \ multi_dir=`echo $${ml} | sed -e 's/^[^;]*;//'`; \ fix_dir=include-fixed$${multi_dir}; \ if $(LIMITS_H_TEST) ; then \ cat $(srcdir)/limitx.h $(srcdir)/glimits.h $(srcdir)/limity.h > tmp-xlimits.h; \ else \ cat $(srcdir)/glimits.h > tmp-xlimits.h; \ fi; \ [...] Since Newlib is normally built as part of the GCC cross compiler build it makes no sense to use directories of the installation tree for this test. The installation tree should not affect the build of GCC with Newlib. For RTEMS there are some hacks to deal with this limits.h problem in "gcc/config/t-rtems" and "libgcc/config/t-rtems", but I think we should get rid of this RTEMS special case solution. There is already a --with-newlib configure option, so maybe it makes sense to use it for the "stmp-int-hdrs" Makefile target? If I edited gcc/Makefile # Default cross SYSTEM_HEADER_DIR, to be overridden by targets. CROSS_SYSTEM_HEADER_DIR = $(objdir)/../$(target_subdir)/newlib/targ-include and now the right GCC provided limits.h will be generated. diff --git a/gcc/configure.ac b/gcc/configure.ac index 0023b2a..020d34c 100644 --- a/gcc/configure.ac +++ b/gcc/configure.ac @@ -1879,6 +1879,7 @@ if { { test x$host != x$target && test "x$with_sysroot" = x ; } || test x$with_newlib = xyes ; } && { test "x$with_headers" = x || test "x$with_headers" = xno ; } ; then inhibit_libc=true + CROSS_SYSTEM_HEADER_DIR='$(objdir)/../$(target_subdir)/newlib/targ-include' fi AC_SUBST(inhibit_libc) Unfortunately this doesn't work, since the "stmp-int-hdrs" Makefile target is built before the includes are copied to the '$(objdir)/../$(target_subdir)/newlib/targ-include' directory :-( Does anyone know off hand if it is feasible to change this in the build mechanic? -- Sebastian Huber, embedded brains GmbH Address : Dornierstr. 4, D-82178 Puchheim, Germany Phone : +49 89 189 47 41-16 Fax : +49 89 189 47 41-09 E-Mail : sebastian.hu...@embedded-brains.de PGP : Public key available on request. Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
Re: LIMITS_H_TEST and Newlib
On Fri, 3 Jan 2014, Sebastian Huber wrote: > There is already a --with-newlib configure option, so maybe it makes sense to > use it for the "stmp-int-hdrs" Makefile target? The --with-newlib option is a badly named option that really means "set inhibit_libc". That is, it's for an initial bootstrap compiler build, whatever C library might be in use (and typically there'd be another compiler build once that actual C library and headers have been installed using the first bootstrap compiler, with this second compiler build being the one that should actually be fully configured for the C library in use). So it might not be a good idea to make it do anything specific to newlib. -- Joseph S. Myers jos...@codesourcery.com
How to generate AVX512 instructions now (just to look at them).
I am trying to figure out how the top-consuming routines in our weather models will be compiled when using AVX512 instructions (and their 32 512 bit registers). I thought an up-to-date trunk version of gcc, using the command line: <...>/gfortran -Ofast -S -mavx2 -mavx512f would do that. Unfortunately, I do not see any use of the new zmm.. registers, which might mean that AVX512 isn't used yet. This is how the nightly build job builds the trunk gfortran compiler: configure --prefix=/home/toon/compilers/install --with-gnu-as --with-gnu-ld --enable-languages=fortran<,other-language> --disable-multilib --disable-nls --with-arch=core-avx2 --with-tune=core-avx2 Is it the --with-arch=core-avx2 ? Or perhaps the --with-gnu-as --with-gnu-ld (because the installed ones do not support AVX512 yet ?). Thanks in advance, -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: How to generate AVX512 instructions now (just to look at them).
On 1/3/2014 11:04 AM, Toon Moene wrote: I am trying to figure out how the top-consuming routines in our weather models will be compiled when using AVX512 instructions (and their 32 512 bit registers). I thought an up-to-date trunk version of gcc, using the command line: <...>/gfortran -Ofast -S -mavx2 -mavx512f would do that. Unfortunately, I do not see any use of the new zmm.. registers, which might mean that AVX512 isn't used yet. This is how the nightly build job builds the trunk gfortran compiler: configure --prefix=/home/toon/compilers/install --with-gnu-as --with-gnu-ld --enable-languages=fortran<,other-language> --disable-multilib --disable-nls --with-arch=core-avx2 --with-tune=core-avx2 gfortran -O3 -funroll-loops --param max-unroll-times=2 -ffast-math -mavx512f -fopenmp -S is giving me extremely limited zmm register usage in my build of gfortran trunk. It appears to be using zmm only to enable use of vpternlogd instructions. Immediately following the first such usage, it is failing to vectorize a dot_product with stride 1 operands. There are still AVX2 scalar instructions and AVX-256 vectorized loops, but none with reduction or fma. For gcc, I have to add -march=native in order for it to accept fma intrinsics (even though that one is expanded to AVX without fma). Sorry, my only AVX2 CPU is a Windows 8.1 installation (!). Target: x86_64-unknown-cygwin Configured with: ../configure --prefix=/usr/local/gcc4.9/ --enable-languages='c c++ fortran' --enable-libgomp --enable-threads=posix --disable-libmudflap --disa ble-__cxa_atexit --with-dwarf2 --without-libiconv-prefix --without-libintl-prefi x --with-system-zlib -- Tim Prince
Re: How to generate AVX512 instructions now (just to look at them).
On Fri, Jan 03, 2014 at 05:04:55PM +0100, Toon Moene wrote: > I am trying to figure out how the top-consuming routines in our > weather models will be compiled when using AVX512 instructions (and > their 32 512 bit registers). > > I thought an up-to-date trunk version of gcc, using the command line: > > <...>/gfortran -Ofast -S -mavx2 -mavx512f > > would do that. > > Unfortunately, I do not see any use of the new zmm.. registers, > which might mean that AVX512 isn't used yet. > > This is how the nightly build job builds the trunk gfortran compiler: > > configure --prefix=/home/toon/compilers/install --with-gnu-as > --with-gnu-ld --enable-languages=fortran<,other-language> > --disable-multilib --disable-nls --with-arch=core-avx2 > --with-tune=core-avx2 > > Is it the --with-arch=core-avx2 ? Or perhaps the --with-gnu-as > --with-gnu-ld (because the installed ones do not support AVX512 yet > ?). You shouldn't need assembler with AVX512 support just for -S, if I try say simple: void f1 (int *__restrict e, int *__restrict f) { int i; for (i = 0; i < 1024; i++) e[i] = f[i] * 7; } void f2 (int *__restrict e, int *__restrict f) { int i; for (i = 0; i < 1024; i++) e[i] = f[i]; } -O2 -ftree-vectorize -mavx512f I get: vmovdqa64 .LC0(%rip), %zmm1 xorl%eax, %eax .p2align 4,,10 .p2align 3 .L2: vpmulld (%rsi,%rax), %zmm1, %zmm0 vmovdqu32 %zmm0, (%rdi,%rax) addq$64, %rax cmpq$4096, %rax jne .L2 rep; ret and xorl%eax, %eax .p2align 4,,10 .p2align 3 .L6: vmovdqu64 (%rsi,%rax), %zmm0 vmovdqu32 %zmm0, (%rdi,%rax) addq$64, %rax cmpq$4096, %rax jne .L6 rep; ret You can look at -fdump-tree-vect-details if something hasn't been vectorized why it hasn't been vectorized. Jakub
Re: How to generate AVX512 instructions now (just to look at them).
On 01/03/2014 07:04 PM, Jakub Jelinek wrote: On Fri, Jan 03, 2014 at 05:04:55PM +0100, Toon Moene wrote: I am trying to figure out how the top-consuming routines in our weather models will be compiled when using AVX512 instructions (and their 32 512 bit registers). I thought an up-to-date trunk version of gcc, using the command line: <...>/gfortran -Ofast -S -mavx2 -mavx512f would do that. Unfortunately, I do not see any use of the new zmm.. registers, which might mean that AVX512 isn't used yet. This is how the nightly build job builds the trunk gfortran compiler: configure --prefix=/home/toon/compilers/install --with-gnu-as --with-gnu-ld --enable-languages=fortran<,other-language> --disable-multilib --disable-nls --with-arch=core-avx2 --with-tune=core-avx2 Is it the --with-arch=core-avx2 ? Or perhaps the --with-gnu-as --with-gnu-ld (because the installed ones do not support AVX512 yet ?). You shouldn't need assembler with AVX512 support just for -S, if I try say simple: void f1 (int *__restrict e, int *__restrict f) { int i; for (i = 0; i < 1024; i++) e[i] = f[i] * 7; } I don't doubt that would work, what I'm interested in, is (cat verintlin.f): SUBROUTINE VERINT ( I KLON , KLAT , KLEV , KINT , KHALO I , KLON1 , KLON2 , KLAT1 , KLAT2 I , KP , KQ , KR R , PARG , PRES R , PALFH , PBETH R , PALFA , PBETA , PGAMA ) C C*** C C VERINT - THREE DIMENSIONAL INTERPOLATION C C PURPOSE: C C THREE DIMENSIONAL INTERPOLATION C C INPUT PARAMETERS: C C KLON NUMBER OF GRIDPOINTS IN X-DIRECTION C KLAT NUMBER OF GRIDPOINTS IN Y-DIRECTION C KLEV NUMBER OF VERTICAL LEVELS C KINT TYPE OF INTERPOLATION C= 1 - LINEAR C= 2 - QUADRATIC C= 3 - CUBIC C= 4 - MIXED CUBIC/LINEAR C KLON1 FIRST GRIDPOINT IN X-DIRECTION C KLON2 LAST GRIDPOINT IN X-DIRECTION C KLAT1 FIRST GRIDPOINT IN Y-DIRECTION C KLAT2 LAST GRIDPOINT IN Y-DIRECTION C KPARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KQARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KRARRAY OF INDEXES FOR VERTICAL DISPLACEMENTS C PARG ARRAY OF ARGUMENTS C PALFH ALFA HAT C PBETH BETA HAT C PALFA ARRAY OF WEIGHTS IN X-DIRECTION C PBETA ARRAY OF WEIGHTS IN Y-DIRECTION C PGAMA ARRAY OF WEIGHTS IN VERTICAL DIRECTION C C OUTPUT PARAMETERS: C C PRES INTERPOLATED FIELD C C HISTORY: C C J.E. HAUGEN 1 1992 C C*** C IMPLICIT NONE C INTEGER KLON , KLAT , KLEV , KINT , KHALO, IKLON1 , KLON2 , KLAT1 , KLAT2 C INTEGER KP(KLON,KLAT), KQ(KLON,KLAT), KR(KLON,KLAT) REALPARG(2-KHALO:KLON+KHALO-1,2-KHALO:KLAT+KHALO-1,KLEV) , RPRES(KLON,KLAT) , R PALFH(KLON,KLAT) , PBETH(KLON,KLAT) , R PALFA(KLON,KLAT,4) , PBETA(KLON,KLAT,4), R PGAMA(KLON,KLAT,4) C INTEGER JX, JY, IDX, IDY, ILEV REAL Z1MAH, Z1MBH C C LINEAR INTERPOLATION C DO JY = KLAT1,KLAT2 DO JX = KLON1,KLON2 IDX = KP(JX,JY) IDY = KQ(JX,JY) ILEV = KR(JX,JY) C PRES(JX,JY) = PGAMA(JX,JY,1)*( C + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV-1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV-1) ) ) C+ + + PGAMA(JX,JY,2)*( C+ + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV ) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV ) ) ) ENDDO ENDDO C RETURN END i.e., real Fortran code, not just intrinsics :-) Thanks, -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: How to generate AVX512 instructions now (just to look at them).
On 1/3/2014 2:58 PM, Toon Moene wrote: On 01/03/2014 07:04 PM, Jakub Jelinek wrote: On Fri, Jan 03, 2014 at 05:04:55PM +0100, Toon Moene wrote: I am trying to figure out how the top-consuming routines in our weather models will be compiled when using AVX512 instructions (and their 32 512 bit registers). what I'm interested in, is (cat verintlin.f): SUBROUTINE VERINT ( I KLON , KLAT , KLEV , KINT , KHALO I , KLON1 , KLON2 , KLAT1 , KLAT2 I , KP , KQ , KR R , PARG , PRES R , PALFH , PBETH R , PALFA , PBETA , PGAMA ) C C*** C C VERINT - THREE DIMENSIONAL INTERPOLATION C C PURPOSE: C C THREE DIMENSIONAL INTERPOLATION C C INPUT PARAMETERS: C C KLON NUMBER OF GRIDPOINTS IN X-DIRECTION C KLAT NUMBER OF GRIDPOINTS IN Y-DIRECTION C KLEV NUMBER OF VERTICAL LEVELS C KINT TYPE OF INTERPOLATION C= 1 - LINEAR C= 2 - QUADRATIC C= 3 - CUBIC C= 4 - MIXED CUBIC/LINEAR C KLON1 FIRST GRIDPOINT IN X-DIRECTION C KLON2 LAST GRIDPOINT IN X-DIRECTION C KLAT1 FIRST GRIDPOINT IN Y-DIRECTION C KLAT2 LAST GRIDPOINT IN Y-DIRECTION C KPARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KQARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KRARRAY OF INDEXES FOR VERTICAL DISPLACEMENTS C PARG ARRAY OF ARGUMENTS C PALFH ALFA HAT C PBETH BETA HAT C PALFA ARRAY OF WEIGHTS IN X-DIRECTION C PBETA ARRAY OF WEIGHTS IN Y-DIRECTION C PGAMA ARRAY OF WEIGHTS IN VERTICAL DIRECTION C C OUTPUT PARAMETERS: C C PRES INTERPOLATED FIELD C C HISTORY: C C J.E. HAUGEN 1 1992 C C*** C IMPLICIT NONE C INTEGER KLON , KLAT , KLEV , KINT , KHALO, IKLON1 , KLON2 , KLAT1 , KLAT2 C INTEGER KP(KLON,KLAT), KQ(KLON,KLAT), KR(KLON,KLAT) REAL PARG(2-KHALO:KLON+KHALO-1,2-KHALO:KLAT+KHALO-1,KLEV) , RPRES(KLON,KLAT) , R PALFH(KLON,KLAT) , PBETH(KLON,KLAT) , R PALFA(KLON,KLAT,4) , PBETA(KLON,KLAT,4), R PGAMA(KLON,KLAT,4) C INTEGER JX, JY, IDX, IDY, ILEV REAL Z1MAH, Z1MBH C C LINEAR INTERPOLATION C DO JY = KLAT1,KLAT2 DO JX = KLON1,KLON2 IDX = KP(JX,JY) IDY = KQ(JX,JY) ILEV = KR(JX,JY) C PRES(JX,JY) = PGAMA(JX,JY,1)*( C + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV-1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV-1) ) ) C+ + + PGAMA(JX,JY,2)*( C+ + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV ) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV ) ) ) ENDDO ENDDO C RETURN END i.e., real Fortran code, not just intrinsics :-) Right out of the AVX512 architect's dream. It appears to need 24 AVX-512 registers in the ifort compilation (/arch:MIC-AVX512) to avoid those spills and repeated memory operands in the gfortran avx2 compilation. How small a ratio of floating point to total instructions can you call "real Fortran?" -- Tim Prince