Undefined behavior or gcc is doing additional good job?

2014-01-03 Thread Bin.Cheng
Hi, For below simple example:
#include 

extern uint32_t __bss_start[];
extern uint32_t __data_start[];

void Reset_Handler(void)
{
 /* Clear .bss section (initialize with zeros) */
 for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) {
  *bss_ptr = 0;
 }
}

One snapshot of our branch generates:
Reset_Handler:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldrr2, .L6
ldrr1, .L6+4
subsr1, r1, r2
bicr1, r1, #3
movsr3, #0
.L2:
cmpr3, r1
beq.L5
movsr0, #0
strr0, [r2, r3]
addsr3, r3, #4
b.L2
.L5:
bxlr
.L7:
.align2
.L6:
.word__bss_start
.word__data_start
.sizeReset_Handler, .-Reset_Handler

I know the IVOPT chooses wrong candidate here, but what I am not sure about is:
0) the original code is not safe. It could result in infinite loop if
there is any alignment issue of __bss_ptr and __data_start.
1) GCC explicitly clears the two lower bits of (__bss_ptr -
__data_start). This makes the loop safe (from infinite loop).

My question is, is it intended for GCC to do such transformation?

Thanks,
bin
-- 
Best Regards.


Re: Undefined behavior or gcc is doing additional good job?

2014-01-03 Thread Jakub Jelinek
On Fri, Jan 03, 2014 at 04:12:19PM +0800, Bin.Cheng wrote:
> Hi, For below simple example:
> #include 
> 
> extern uint32_t __bss_start[];
> extern uint32_t __data_start[];
> 
> void Reset_Handler(void)
> {
>  /* Clear .bss section (initialize with zeros) */
>  for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) {
>   *bss_ptr = 0;
>  }
> }

I believe this is undefined behavior, so GCC can assume
bss_ptr != __data_start is true always.  You need something like
memset (__bss_start, 0, (uintptr_t) __data_start - (uintptr_t) __bss_start);
(note the cases to non-pointers), then it is just implementation defined
behavior.  Or do
uint32_t data_ptr;
asm ("" : "g" (data_ptr) : "0" (__data_start));
for (uint32_t* bss_ptr = __bss_start; bss_ptr != data_ptr; ++bss_ptr) {
  *bss_ptr = 0;
}
and thus hide from the compiler the fact that __data_start is in a different
object.

Jakub


Re: Undefined behavior or gcc is doing additional good job?

2014-01-03 Thread Bin.Cheng
On Fri, Jan 3, 2014 at 4:24 PM, Jakub Jelinek  wrote:
> On Fri, Jan 03, 2014 at 04:12:19PM +0800, Bin.Cheng wrote:
>> Hi, For below simple example:
>> #include 
>>
>> extern uint32_t __bss_start[];
>> extern uint32_t __data_start[];
>>
>> void Reset_Handler(void)
>> {
>>  /* Clear .bss section (initialize with zeros) */
>>  for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) {
>>   *bss_ptr = 0;
>>  }
>> }
>
> I believe this is undefined behavior, so GCC can assume
> bss_ptr != __data_start is true always.  You need something like
Sorry for posting the premature question.  Since both __bss_start and
__data_start are declared as array, it seems there is no undefined
behavior, the check is between two pointers with same type actually,
right?  So the question remains, why GCC would clear the two lower
bits of " __data_start - __bss_start" then?  Am I some stupid mistake?

Thanks,
bin

> memset (__bss_start, 0, (uintptr_t) __data_start - (uintptr_t) __bss_start);
> (note the cases to non-pointers), then it is just implementation defined
> behavior.  Or do
> uint32_t data_ptr;
> asm ("" : "g" (data_ptr) : "0" (__data_start));
> for (uint32_t* bss_ptr = __bss_start; bss_ptr != data_ptr; ++bss_ptr) {
>   *bss_ptr = 0;
> }
> and thus hide from the compiler the fact that __data_start is in a different
> object.
>
> Jakub



-- 
Best Regards.


Re: Undefined behavior or gcc is doing additional good job?

2014-01-03 Thread Jakub Jelinek
On Fri, Jan 03, 2014 at 04:44:48PM +0800, Bin.Cheng wrote:
> >> extern uint32_t __bss_start[];
> >> extern uint32_t __data_start[];
> >>
> >> void Reset_Handler(void)
> >> {
> >>  /* Clear .bss section (initialize with zeros) */
> >>  for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) 
> >> {
> >>   *bss_ptr = 0;
> >>  }
> >> }
> >
> > I believe this is undefined behavior, so GCC can assume
> > bss_ptr != __data_start is true always.  You need something like
> Sorry for posting the premature question.  Since both __bss_start and
> __data_start are declared as array, it seems there is no undefined
> behavior, the check is between two pointers with same type actually,

I think this has been discussed in some PR, unfortunately I can't find it.
If it was < or <=, then it would be obvious undefined behavior, those
comparisons can't be performed between different objects, the above is
questionable, because you still assume that you get through pointer
arithmetics from one object to another one, without dereference pointer
arithmetics can be at one past last entry in the array, but whether that is
equal to the object object is still quite problematic.

> right?  So the question remains, why GCC would clear the two lower
> bits of " __data_start - __bss_start" then?  Am I some stupid mistake?

That said, if either of __bss_start of __data_start aren't 32-bit aligned,
then it is a clear undefined behavior, the masking of low 2 bits (doesn't
happen on x86_64) comes from IVopts computing the end as
((__data_start - __bss_start) + 1) * 4 and the __data_start - __bss_start
is exact division by 4, apparently we don't fold that back to just
(char *) __data_start - (char *) __bss_start + 4.

Jakub


Why __builtin_sqrt do not totally replace sqrt in asm

2014-01-03 Thread BELBACHIR Selim
Hi,

When the standard pattern 'sqrtm2' is defined I don't understand why calls to 
sqrt or __builtin_sqrt is always followed by a comparison of the result with 
itself (checking the NaN ?) and a conditional branch to sqrt symbol (so linking 
with libm is always mandatory).

-
mov $FP0,$FP1
fsqrt $FP0, $FP0<< the builtin_sqrt
fcompare $FP0,$FP0 << strange compare of the result of builtin_sqrt
jmp.ifEQUAL .L2
mov $FP1,$FP0
branch sqrt<< branch to sqrt symbol if $FP0 != $FP0
.L2
-

Is there a way to tell GCC that sqrt function is totally handled by 
__builtin_sqrt ?

Regards,

Selim





Re: Why __builtin_sqrt do not totally replace sqrt in asm

2014-01-03 Thread Jakub Jelinek
On Fri, Jan 03, 2014 at 10:44:21AM +0100, BELBACHIR Selim wrote:
> When the standard pattern 'sqrtm2' is defined I don't understand why calls
> to sqrt or __builtin_sqrt is always followed by a comparison of the result
> with itself (checking the NaN ?) and a conditional branch to sqrt symbol
> (so linking with libm is always mandatory).

Because -fmath-errno is the default and sqrt for negative value (including
-Inf) is supposed to set errno.  Use -ffast-math or -fno-math-errno if you
don't need/want that.  On some targets GCC is able to emit code to set errno
directly, on others GCC just emits a call to the library function so that it
handles errno properly.

Jakub


lto testsuite may erase mathlib variable

2014-01-03 Thread BELBACHIR Selim
Hi,

I noticed a problem in gcc/testsuite/g++.dg/lto/lto.exp

If the target does not support LTO (check_effective_target_lto) a brutal return 
is performed so the mathlib variable modified in lto_init will not be restored 
properly by lto_finish at the end of the script.

Subsequent testsuites will found an empty mathlib.

Regards,

Selim




patch
Description: patch


Re: Undefined behavior or gcc is doing additional good job?

2014-01-03 Thread Joseph S. Myers
On Fri, 3 Jan 2014, Jakub Jelinek wrote:

> I think this has been discussed in some PR, unfortunately I can't find it.

Bug 57725?

-- 
Joseph S. Myers
jos...@codesourcery.com


LIMITS_H_TEST and Newlib

2014-01-03 Thread Sebastian Huber

Hello,

in gcc/Makefile, there is a test to determine how to set up the GCC 
provided limits.h.  Here is a collection of the relevant Makefile parts:


# 
# Installation directories
# 

# Common prefix for installation directories.
# NOTE: This directory must exist when you start installation.
prefix = /the/prefix

# Directory in which to put host dependent programs and libraries
exec_prefix = ${prefix}

# Directory in which to put the directories used by the compiler.
libdir = ${exec_prefix}/lib64

# Directory in which the compiler finds libraries etc.
libsubdir = $(libdir)/gcc/$(target_noncanonical)/$(version)

# Used in install-cross.
gcc_tooldir = $(libsubdir)/$(libsubdir_to_prefix)$(target_noncanonical)

# Default cross SYSTEM_HEADER_DIR, to be overridden by targets.
CROSS_SYSTEM_HEADER_DIR = $(gcc_tooldir)/sys-include

# autoconf sets SYSTEM_HEADER_DIR to one of the above.
# Purge it of unnecessary internal relative paths
# to directories that might not exist yet.
# The sed idiom for this is to repeat the search-and-replace until it 
doesn't match, using :a ... ta.

# Use single quotes here to avoid nested double- and backquotes, this
# macro is also used in a double-quoted context.
SYSTEM_HEADER_DIR = `echo $(CROSS_SYSTEM_HEADER_DIR) | sed -e :a -e 
's,[^/]*/\.\.\/,,' -e ta`


# Test to see whether  exists in the system header files.
LIMITS_H_TEST = [ -f $(SYSTEM_HEADER_DIR)/limits.h ]

#
# Build the include directories.  The stamp files are stmp-* rather than
# s-* so that mostlyclean does not force the include directory to
# be rebuilt.

# Build the include directories.
stmp-int-hdrs: $(STMP_FIXINC) $(USER_H) fixinc_list
# Copy in the headers provided with gcc.
#
# The sed command gets just the last file name component;
# this is necessary because VPATH could add a dirname.
# Using basename would be simpler, but some systems don't have it.
#
# The touch command is here to workaround an AIX/Linux NFS bug.
#
# The move-if-change + cp -p twists for limits.h are intended to preserve
# the time stamp when we regenerate, to prevent pointless rebuilds during
# e.g. install-no-fixedincludes.
[...]
set -e; for ml in `cat fixinc_list`; do \
  sysroot_headers_suffix=`echo $${ml} | sed -e 's/;.*$$//'`; \
  multi_dir=`echo $${ml} | sed -e 's/^[^;]*;//'`; \
  fix_dir=include-fixed$${multi_dir}; \
  if $(LIMITS_H_TEST) ; then \
cat $(srcdir)/limitx.h $(srcdir)/glimits.h $(srcdir)/limity.h > 
tmp-xlimits.h; \

  else \
cat $(srcdir)/glimits.h > tmp-xlimits.h; \
  fi; \
[...]

Since Newlib is normally built as part of the GCC cross compiler build 
it makes no sense to use directories of the installation tree for this 
test.  The installation tree should not affect the build of GCC with Newlib.


For RTEMS there are some hacks to deal with this limits.h problem in 
"gcc/config/t-rtems" and "libgcc/config/t-rtems", but I think we should 
get rid of this RTEMS special case solution.


There is already a --with-newlib configure option, so maybe it makes 
sense to use it for the "stmp-int-hdrs" Makefile target?


If I edited gcc/Makefile

# Default cross SYSTEM_HEADER_DIR, to be overridden by targets.
CROSS_SYSTEM_HEADER_DIR = $(objdir)/../$(target_subdir)/newlib/targ-include

and now the right GCC provided limits.h will be generated.

diff --git a/gcc/configure.ac b/gcc/configure.ac
index 0023b2a..020d34c 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -1879,6 +1879,7 @@ if { { test x$host != x$target && test 
"x$with_sysroot" = x ; } ||

test x$with_newlib = xyes ; } &&
  { test "x$with_headers" = x || test "x$with_headers" = xno ; } ; then
inhibit_libc=true
+ 
CROSS_SYSTEM_HEADER_DIR='$(objdir)/../$(target_subdir)/newlib/targ-include'

 fi
 AC_SUBST(inhibit_libc)

Unfortunately this doesn't work, since the "stmp-int-hdrs" Makefile 
target is built before the includes are copied to the 
'$(objdir)/../$(target_subdir)/newlib/targ-include' directory :-(


Does anyone know off hand if it is feasible to change this in the build 
mechanic?


--
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax : +49 89 189 47 41-09
E-Mail  : sebastian.hu...@embedded-brains.de
PGP : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.



Re: LIMITS_H_TEST and Newlib

2014-01-03 Thread Joseph S. Myers
On Fri, 3 Jan 2014, Sebastian Huber wrote:

> There is already a --with-newlib configure option, so maybe it makes sense to
> use it for the "stmp-int-hdrs" Makefile target?

The --with-newlib option is a badly named option that really means "set 
inhibit_libc".  That is, it's for an initial bootstrap compiler build, 
whatever C library might be in use (and typically there'd be another 
compiler build once that actual C library and headers have been installed 
using the first bootstrap compiler, with this second compiler build being 
the one that should actually be fully configured for the C library in 
use).  So it might not be a good idea to make it do anything specific to 
newlib.

-- 
Joseph S. Myers
jos...@codesourcery.com


How to generate AVX512 instructions now (just to look at them).

2014-01-03 Thread Toon Moene
I am trying to figure out how the top-consuming routines in our weather 
models will be compiled when using AVX512 instructions (and their 32 512 
bit registers).


I thought an up-to-date trunk version of gcc, using the command line:

<...>/gfortran -Ofast -S -mavx2 -mavx512f 

would do that.

Unfortunately, I do not see any use of the new zmm.. registers, which 
might mean that AVX512 isn't used yet.


This is how the nightly build job builds the trunk gfortran compiler:

configure --prefix=/home/toon/compilers/install --with-gnu-as 
--with-gnu-ld --enable-languages=fortran<,other-language> 
--disable-multilib --disable-nls --with-arch=core-avx2 --with-tune=core-avx2


Is it the --with-arch=core-avx2 ? Or perhaps the --with-gnu-as 
--with-gnu-ld (because the installed ones do not support AVX512 yet ?).


Thanks in advance,

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news


Re: How to generate AVX512 instructions now (just to look at them).

2014-01-03 Thread Tim Prince


On 1/3/2014 11:04 AM, Toon Moene wrote:
I am trying to figure out how the top-consuming routines in our 
weather models will be compiled when using AVX512 instructions (and 
their 32 512 bit registers).


I thought an up-to-date trunk version of gcc, using the command line:

<...>/gfortran -Ofast -S -mavx2 -mavx512f 

would do that.

Unfortunately, I do not see any use of the new zmm.. registers, which 
might mean that AVX512 isn't used yet.


This is how the nightly build job builds the trunk gfortran compiler:

configure --prefix=/home/toon/compilers/install --with-gnu-as 
--with-gnu-ld --enable-languages=fortran<,other-language> 
--disable-multilib --disable-nls --with-arch=core-avx2 
--with-tune=core-avx2
gfortran -O3  -funroll-loops --param max-unroll-times=2 -ffast-math 
-mavx512f -fopenmp -S
is giving me extremely limited zmm register usage in my build of 
gfortran trunk.  It appears to be using zmm only to enable use of 
vpternlogd instructions.  Immediately following the first such usage, it 
is failing to vectorize a dot_product with stride 1 operands.  There are 
still AVX2 scalar instructions and AVX-256 vectorized loops, but none 
with reduction or fma.
For gcc, I have to add -march=native in order for it to accept fma 
intrinsics (even though that one is expanded to AVX without fma).

Sorry, my only AVX2 CPU is a Windows 8.1 installation (!).

Target: x86_64-unknown-cygwin
Configured with: ../configure --prefix=/usr/local/gcc4.9/ 
--enable-languages='c
c++ fortran' --enable-libgomp --enable-threads=posix 
--disable-libmudflap --disa
ble-__cxa_atexit --with-dwarf2 --without-libiconv-prefix 
--without-libintl-prefi

x --with-system-zlib

--
Tim Prince



Re: How to generate AVX512 instructions now (just to look at them).

2014-01-03 Thread Jakub Jelinek
On Fri, Jan 03, 2014 at 05:04:55PM +0100, Toon Moene wrote:
> I am trying to figure out how the top-consuming routines in our
> weather models will be compiled when using AVX512 instructions (and
> their 32 512 bit registers).
> 
> I thought an up-to-date trunk version of gcc, using the command line:
> 
> <...>/gfortran -Ofast -S -mavx2 -mavx512f 
> 
> would do that.
> 
> Unfortunately, I do not see any use of the new zmm.. registers,
> which might mean that AVX512 isn't used yet.
> 
> This is how the nightly build job builds the trunk gfortran compiler:
> 
> configure --prefix=/home/toon/compilers/install --with-gnu-as
> --with-gnu-ld --enable-languages=fortran<,other-language>
> --disable-multilib --disable-nls --with-arch=core-avx2
> --with-tune=core-avx2
> 
> Is it the --with-arch=core-avx2 ? Or perhaps the --with-gnu-as
> --with-gnu-ld (because the installed ones do not support AVX512 yet
> ?).

You shouldn't need assembler with AVX512 support just for -S,
if I try say simple:
void f1 (int *__restrict e, int *__restrict f) { int i; for (i = 0; i < 1024; 
i++) e[i] = f[i] * 7; }
void f2 (int *__restrict e, int *__restrict f) { int i; for (i = 0; i < 1024; 
i++) e[i] = f[i]; }
-O2 -ftree-vectorize -mavx512f I get:
vmovdqa64   .LC0(%rip), %zmm1
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L2:
vpmulld (%rsi,%rax), %zmm1, %zmm0
vmovdqu32   %zmm0, (%rdi,%rax)
addq$64, %rax
cmpq$4096, %rax
jne .L2
rep; ret
and
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L6:
vmovdqu64   (%rsi,%rax), %zmm0
vmovdqu32   %zmm0, (%rdi,%rax)
addq$64, %rax
cmpq$4096, %rax
jne .L6
rep; ret

You can look at -fdump-tree-vect-details if something hasn't been vectorized
why it hasn't been vectorized.

Jakub


Re: How to generate AVX512 instructions now (just to look at them).

2014-01-03 Thread Toon Moene

On 01/03/2014 07:04 PM, Jakub Jelinek wrote:


On Fri, Jan 03, 2014 at 05:04:55PM +0100, Toon Moene wrote:



I am trying to figure out how the top-consuming routines in our
weather models will be compiled when using AVX512 instructions (and
their 32 512 bit registers).

I thought an up-to-date trunk version of gcc, using the command line:

<...>/gfortran -Ofast -S -mavx2 -mavx512f 

would do that.

Unfortunately, I do not see any use of the new zmm.. registers,
which might mean that AVX512 isn't used yet.

This is how the nightly build job builds the trunk gfortran compiler:

configure --prefix=/home/toon/compilers/install --with-gnu-as
--with-gnu-ld --enable-languages=fortran<,other-language>
--disable-multilib --disable-nls --with-arch=core-avx2
--with-tune=core-avx2

Is it the --with-arch=core-avx2 ? Or perhaps the --with-gnu-as
--with-gnu-ld (because the installed ones do not support AVX512 yet
?).


You shouldn't need assembler with AVX512 support just for -S,
if I try say simple:
void f1 (int *__restrict e, int *__restrict f) { int i; for (i = 0; i < 1024; 
i++) e[i] = f[i] * 7; }


I don't doubt that would work, what I'm interested in, is (cat verintlin.f):

  SUBROUTINE VERINT (
 I   KLON   , KLAT   , KLEV   , KINT  , KHALO
 I , KLON1  , KLON2  , KLAT1  , KLAT2
 I , KP , KQ , KR
 R , PARG   , PRES
 R , PALFH  , PBETH
 R , PALFA  , PBETA  , PGAMA   )
C
C***
C
C  VERINT - THREE DIMENSIONAL INTERPOLATION
C
C  PURPOSE:
C
C  THREE DIMENSIONAL INTERPOLATION
C
C  INPUT PARAMETERS:
C
C  KLON  NUMBER OF GRIDPOINTS IN X-DIRECTION
C  KLAT  NUMBER OF GRIDPOINTS IN Y-DIRECTION
C  KLEV  NUMBER OF VERTICAL LEVELS
C  KINT  TYPE OF INTERPOLATION
C= 1 - LINEAR
C= 2 - QUADRATIC
C= 3 - CUBIC
C= 4 - MIXED CUBIC/LINEAR
C  KLON1 FIRST GRIDPOINT IN X-DIRECTION
C  KLON2 LAST  GRIDPOINT IN X-DIRECTION
C  KLAT1 FIRST GRIDPOINT IN Y-DIRECTION
C  KLAT2 LAST  GRIDPOINT IN Y-DIRECTION
C  KPARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS
C  KQARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS
C  KRARRAY OF INDEXES FOR VERTICAL   DISPLACEMENTS
C  PARG  ARRAY OF ARGUMENTS
C  PALFH ALFA HAT
C  PBETH BETA HAT
C  PALFA ARRAY OF WEIGHTS IN X-DIRECTION
C  PBETA ARRAY OF WEIGHTS IN Y-DIRECTION
C  PGAMA ARRAY OF WEIGHTS IN VERTICAL DIRECTION
C
C  OUTPUT PARAMETERS:
C
C  PRES  INTERPOLATED FIELD
C
C  HISTORY:
C
C  J.E. HAUGEN   1  1992
C
C***
C
  IMPLICIT NONE
C
  INTEGER KLON   , KLAT   , KLEV   , KINT   , KHALO,
 IKLON1  , KLON2  , KLAT1  , KLAT2
C
  INTEGER   KP(KLON,KLAT), KQ(KLON,KLAT), KR(KLON,KLAT)
  REALPARG(2-KHALO:KLON+KHALO-1,2-KHALO:KLAT+KHALO-1,KLEV)  ,
 RPRES(KLON,KLAT) ,
 R   PALFH(KLON,KLAT) ,  PBETH(KLON,KLAT)  ,
 R   PALFA(KLON,KLAT,4)   ,  PBETA(KLON,KLAT,4),
 R   PGAMA(KLON,KLAT,4)
C
  INTEGER JX, JY, IDX, IDY, ILEV
  REAL Z1MAH, Z1MBH
C
C  LINEAR INTERPOLATION
C
  DO JY = KLAT1,KLAT2
  DO JX = KLON1,KLON2
 IDX  = KP(JX,JY)
 IDY  = KQ(JX,JY)
 ILEV = KR(JX,JY)
C
 PRES(JX,JY) = PGAMA(JX,JY,1)*(
C
 +   PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1)
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY-1,ILEV-1) )
 + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY  ,ILEV-1)
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY  ,ILEV-1) ) )
C+
 +   + PGAMA(JX,JY,2)*(
C+
 +   PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV  )
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY-1,ILEV  ) )
 + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY  ,ILEV  )
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY  ,ILEV  ) ) )
  ENDDO
  ENDDO
C
  RETURN
  END

i.e., real Fortran code, not just intrinsics :-)

Thanks,

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news


Re: How to generate AVX512 instructions now (just to look at them).

2014-01-03 Thread Tim Prince


On 1/3/2014 2:58 PM, Toon Moene wrote:

On 01/03/2014 07:04 PM, Jakub Jelinek wrote:


On Fri, Jan 03, 2014 at 05:04:55PM +0100, Toon Moene wrote:



I am trying to figure out how the top-consuming routines in our
weather models will be compiled when using AVX512 instructions (and
their 32 512 bit registers). 



what I'm interested in, is (cat verintlin.f):


  SUBROUTINE VERINT (
 I   KLON   , KLAT   , KLEV   , KINT  , KHALO
 I , KLON1  , KLON2  , KLAT1  , KLAT2
 I , KP , KQ , KR
 R , PARG   , PRES
 R , PALFH  , PBETH
 R , PALFA  , PBETA  , PGAMA   )
C
C***
C
C  VERINT - THREE DIMENSIONAL INTERPOLATION
C
C  PURPOSE:
C
C  THREE DIMENSIONAL INTERPOLATION
C
C  INPUT PARAMETERS:
C
C  KLON  NUMBER OF GRIDPOINTS IN X-DIRECTION
C  KLAT  NUMBER OF GRIDPOINTS IN Y-DIRECTION
C  KLEV  NUMBER OF VERTICAL LEVELS
C  KINT  TYPE OF INTERPOLATION
C= 1 - LINEAR
C= 2 - QUADRATIC
C= 3 - CUBIC
C= 4 - MIXED CUBIC/LINEAR
C  KLON1 FIRST GRIDPOINT IN X-DIRECTION
C  KLON2 LAST  GRIDPOINT IN X-DIRECTION
C  KLAT1 FIRST GRIDPOINT IN Y-DIRECTION
C  KLAT2 LAST  GRIDPOINT IN Y-DIRECTION
C  KPARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS
C  KQARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS
C  KRARRAY OF INDEXES FOR VERTICAL   DISPLACEMENTS
C  PARG  ARRAY OF ARGUMENTS
C  PALFH ALFA HAT
C  PBETH BETA HAT
C  PALFA ARRAY OF WEIGHTS IN X-DIRECTION
C  PBETA ARRAY OF WEIGHTS IN Y-DIRECTION
C  PGAMA ARRAY OF WEIGHTS IN VERTICAL DIRECTION
C
C  OUTPUT PARAMETERS:
C
C  PRES  INTERPOLATED FIELD
C
C  HISTORY:
C
C  J.E. HAUGEN   1  1992
C
C***
C
  IMPLICIT NONE
C
  INTEGER KLON   , KLAT   , KLEV   , KINT   , KHALO,
 IKLON1  , KLON2  , KLAT1  , KLAT2
C
  INTEGER   KP(KLON,KLAT), KQ(KLON,KLAT), KR(KLON,KLAT)
  REAL PARG(2-KHALO:KLON+KHALO-1,2-KHALO:KLAT+KHALO-1,KLEV)  ,
 RPRES(KLON,KLAT) ,
 R   PALFH(KLON,KLAT) ,  PBETH(KLON,KLAT)  ,
 R   PALFA(KLON,KLAT,4)   ,  PBETA(KLON,KLAT,4),
 R   PGAMA(KLON,KLAT,4)
C
  INTEGER JX, JY, IDX, IDY, ILEV
  REAL Z1MAH, Z1MBH
C
C  LINEAR INTERPOLATION
C
  DO JY = KLAT1,KLAT2
  DO JX = KLON1,KLON2
 IDX  = KP(JX,JY)
 IDY  = KQ(JX,JY)
 ILEV = KR(JX,JY)
C
 PRES(JX,JY) = PGAMA(JX,JY,1)*(
C
 +   PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1)
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY-1,ILEV-1) )
 + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY  ,ILEV-1)
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY  ,ILEV-1) ) )
C+
 +   + PGAMA(JX,JY,2)*(
C+
 +   PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV  )
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY-1,ILEV  ) )
 + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY  ,ILEV  )
 +  + PALFA(JX,JY,2)*PARG(IDX  ,IDY  ,ILEV  ) ) )
  ENDDO
  ENDDO
C
  RETURN
  END

i.e., real Fortran code, not just intrinsics :-)

Right out of the AVX512 architect's dream.  It appears to need 24 
AVX-512 registers in the ifort compilation (/arch:MIC-AVX512) to avoid 
those spills and repeated memory operands in the gfortran avx2 compilation.
How small a ratio of floating point to total instructions can you call 
"real Fortran?"


--
Tim Prince