Suboptimal code generated for __buitlin_rint on AMD64 without SS4_4.1

2021-08-05 Thread Stefan Kanthak
Hi,

targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (12 instructions using 51 bytes, plus 4 quadwords
using 32 bytes) for __builtin_rint() when -msse4.1 is NOT given:

.text
   0:   f2 0f 10 15 10 00 00 00 movsd   .LC1(%rip), %xmm2
4: R_X86_64_PC32.rdata
   8:   f2 0f 10 1d 00 00 00 00 movsd   .LC0(%rip), %xmm3
c: R_X86_64_PC32.rdata
  10:   66 0f 28 c8 movapd  %xmm0, %xmm1
  14:   66 0f 54 ca andpd   %xmm2, %xmm1
  18:   66 0f 2f d9 comisd  %xmm1, %xmm3
  1c:   76 14   jbe 32 
  1e:   f2 0f 58 cb addsd   %xmm3, %xmm1
  22:   66 0f 55 d0 andnpd  %xmm0, %xmm2
  26:   f2 0f 5c cb subsd   %xmm3, %xmm1
  2a:   66 0f 56 ca orpd%xmm2, %xmm1
  2e:   66 0f 28 c1 movapd  %xmm1, %xmm0
  32:   c3  retq

.rdata
.align 8
   0:   00 00 00 00 .LC0:   .quad  0x1.0p52
00 00 30 43
00 00 00 00
00 00 00 00
.align 16
  10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
ff ff ff 7f
  18:   00 00 00 00 .quad  0.0
00 00 00 00
.end

JFTR: in the best case, the memory accesses cost several cycles,
  while in the worst case they yield a page fault!


Properly optimized, faster and shorter code, using just 9 instructions
in only 33 bytes, WITHOUT superfluous constants, thus avoiding costly
memory accesses and saving at least 16 + 32 bytes, follows:

  .intel_syntax
  .text
   0:   f2 48 0f 2c c0cvtsd2si rax, xmm0  # rax = llrint(argument)
   5:   48 f7 d8  neg rax
# jz  .L0 # argument zero?
   8:   70 16 jo  .L0 # argument indefinite?
  # argument overflows 64-bit 
integer?
   a:   48 f7 d8  neg rax
   d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax  # xmm1 = rint(argument)
  12:   66 0f 73 d0 3fpsrlq   xmm0, 63
  17:   66 0f 73 f0 3fpsllq   xmm0, 63# xmm0 = (argument & -0.0) ? 
-0.0 : 0.0
  1c:   66 0f 56 c1   orpdxmm0, xmm1  # xmm0 = round(argument)
  20:   c3  .L0:  ret
 .end

regards
Stefan


Suboptimal code generated for __buitlin_ceil on AMD64 without SS4_4.1

2021-08-05 Thread Stefan Kanthak
Hi,

targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (17 instructions using 78 bytes, plus 6 quadwords
using 48 bytes) for __builtin_ceil() when -msse4.1 is NOT given:

.text
   0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
4: R_X86_64_PC32.rdata
   8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
c: R_X86_64_PC32.rdata
  10:   66 0f 28 d8 movapd %xmm0, %xmm3
  14:   66 0f 28 c8 movapd %xmm0, %xmm1
  18:   66 0f 54 da andpd  %xmm2, %xmm3
  1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
  20:   76 2b   jbe4d <_ceil+0x4d>
  22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
  27:   66 0f ef db pxor   %xmm3, %xmm3
  2b:   f2 0f 10 25 20 00 00 00 movsd  0x20(%rip), %xmm4
2f: R_X86_64_PC32   .rdata
  33:   66 0f 55 d1 andnpd %xmm1, %xmm2
  37:   f2 48 0f 2a d8  cvtsi2sd %rax, %xmm3
  3c:   f2 0f c2 c3 06  cmpnlesd %xmm3, %xmm0
  41:   66 0f 54 c4 andpd  %xmm4, %xmm0
  45:   f2 0f 58 c3 addsd  %xmm3, %xmm0
  49:   66 0f 56 c2 orpd   %xmm2, %xmm0
  4d:   c3  retq

.rdata
.align 8
   0:   00 00 00 00 .LC0:   .quad  0x1.0p52
00 00 30 43
00 00 00 00
00 00 00 00
.align 16
  10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
ff ff ff 7f
  18:   00 00 00 00 .quad  0.0
00 00 00 00
.align 8
  20:   00 00 00 00 .LC2:   .quad  0x1.0p0
00 00 f0 3f
00 00 00 00
00 00 00 00
.end

JFTR: in the best case, the memory accesses cost several cycles,
  while in the worst case they yield a page fault!


Properly optimized, faster and shorter code, using just 15 instructions
in 65 bytes, WITHOUT superfluous constants, thus avoiding costly memory
accesses and saving at least 32 bytes, follows:

  .intel_syntax
  .equBIAS, 1023
  .text
   0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
   5:   48 f7 d8  neg rax
# jz  .L0  # argument zero?
   8:   70 36 jo  .L0  # argument indefinite?
   # argument overflows 64-bit 
integer?
   a:   48 f7 d8  neg rax
   d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
  12:   48 a1 00 00 00mov rax, BIAS << 52
  19:   00 00 00 f0 3f
  1c:   66 48 0f 6e d0movqxmm2, rax# xmm2 = 0x1.0p0
  21:   f2 0f 10 d8   movsd   xmm3, xmm0   # xmm3 = argument
  25:   f2 0f c2 d9 02cmplesd xmm3, xmm1   # xmm3 = (argument <= 
trunc(argument)) ? ~0L : 0L
  2a:   66 0f 55 da   andnpd  xmm3, xmm2   # xmm3 = (argument <= 
trunc(argument)) ? 0.0 : 1.0
  2e:   f2 0f 58 d9   addsd   xmm3, xmm1   # xmm3 = (argument > 
trunc(argument)) ? 1.0 : 0.0
   #  + trunc(argument)
   #  = ceil(argument)
  32:   66 0f 73 d0 3fpsrlq   xmm0, 63
  37:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) ? 
-0.0 : 0.0
  3c:   66 0f 56 c3   orpdxmm0, xmm3   # xmm0 = ceil(argument)
  40:   c3  .L0:  ret
  .end

regards
Stefan


Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Stefan Kanthak
Hi,

targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (13 instructions using 57 bytes, plus 4 quadwords
using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:

.text
   0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
4: R_X86_64_PC32.rdata
   8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
c: R_X86_64_PC32.rdata
  10:   66 0f 28 d8 movapd %xmm0, %xmm3
  14:   66 0f 28 c8 movapd %xmm0, %xmm1
  18:   66 0f 54 da andpd  %xmm2, %xmm3
  1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
  20:   76 16   jbe38 <_trunc+0x38>
  22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
  27:   66 0f ef c0 pxor   %xmm0, %xmm0
  2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
  2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
  34:   66 0f 56 c2 orpd   %xmm2, %xmm0
  38:   c3  retq

.rdata
.align 8
   0:   00 00 00 00 .LC0:   .quad  0x1.0p52
00 00 30 43
00 00 00 00
00 00 00 00
.align 16
  10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
ff ff ff 7f
  18:   00 00 00 00 .quad  0.0
00 00 00 00
.end

JFTR: in the best case, the memory accesses cost several cycles,
  while in the worst case they yield a page fault!


Properly optimized, shorter and faster code, using but only 9 instructions
in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
and saving at least 16 + 32 bytes, follows:

  .intel_syntax
  .text
   0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
   5:   48 f7 d8  neg rax
# jz  .L0  # argument zero?
   8:   70 16 jo  .L0  # argument indefinite?
   # argument overflows 64-bit 
integer?
   a:   48 f7 d8  neg rax
   d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
  12:   66 0f 73 d0 3fpsrlq   xmm0, 63
  17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) ? 
-0.0 : 0.0
  1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
  20:   c3  .L0:  ret
  .end

regards
Stefan


Re: Suboptimal code generated for __buitlin_ceil on AMD64 without SS4_4.1

2021-08-05 Thread Hongtao Liu via Gcc
Could you file a bugzilla for that?
https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc

On Thu, Aug 5, 2021 at 3:34 PM Stefan Kanthak  wrote:
>
> Hi,
>
> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> following code (17 instructions using 78 bytes, plus 6 quadwords
> using 48 bytes) for __builtin_ceil() when -msse4.1 is NOT given:
>
> .text
>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> 4: R_X86_64_PC32.rdata
>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> c: R_X86_64_PC32.rdata
>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
>   18:   66 0f 54 da andpd  %xmm2, %xmm3
>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
>   20:   76 2b   jbe4d <_ceil+0x4d>
>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
>   27:   66 0f ef db pxor   %xmm3, %xmm3
>   2b:   f2 0f 10 25 20 00 00 00 movsd  0x20(%rip), %xmm4
> 2f: R_X86_64_PC32   .rdata
>   33:   66 0f 55 d1 andnpd %xmm1, %xmm2
>   37:   f2 48 0f 2a d8  cvtsi2sd %rax, %xmm3
>   3c:   f2 0f c2 c3 06  cmpnlesd %xmm3, %xmm0
>   41:   66 0f 54 c4 andpd  %xmm4, %xmm0
>   45:   f2 0f 58 c3 addsd  %xmm3, %xmm0
>   49:   66 0f 56 c2 orpd   %xmm2, %xmm0
>   4d:   c3  retq
>
> .rdata
> .align 8
>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> 00 00 30 43
> 00 00 00 00
> 00 00 00 00
> .align 16
>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> ff ff ff 7f
>   18:   00 00 00 00 .quad  0.0
> 00 00 00 00
> .align 8
>   20:   00 00 00 00 .LC2:   .quad  0x1.0p0
> 00 00 f0 3f
> 00 00 00 00
> 00 00 00 00
> .end
>
> JFTR: in the best case, the memory accesses cost several cycles,
>   while in the worst case they yield a page fault!
>
>
> Properly optimized, faster and shorter code, using just 15 instructions
> in 65 bytes, WITHOUT superfluous constants, thus avoiding costly memory
> accesses and saving at least 32 bytes, follows:
>
>   .intel_syntax
>   .equBIAS, 1023
>   .text
>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
>5:   48 f7 d8  neg rax
> # jz  .L0  # argument zero?
>8:   70 36 jo  .L0  # argument indefinite?
># argument overflows 
> 64-bit integer?
>a:   48 f7 d8  neg rax
>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>   12:   48 a1 00 00 00mov rax, BIAS << 52
>   19:   00 00 00 f0 3f
>   1c:   66 48 0f 6e d0movqxmm2, rax# xmm2 = 0x1.0p0
>   21:   f2 0f 10 d8   movsd   xmm3, xmm0   # xmm3 = argument
>   25:   f2 0f c2 d9 02cmplesd xmm3, xmm1   # xmm3 = (argument <= 
> trunc(argument)) ? ~0L : 0L
>   2a:   66 0f 55 da   andnpd  xmm3, xmm2   # xmm3 = (argument <= 
> trunc(argument)) ? 0.0 : 1.0
>   2e:   f2 0f 58 d9   addsd   xmm3, xmm1   # xmm3 = (argument > 
> trunc(argument)) ? 1.0 : 0.0
>#  + trunc(argument)
>#  = ceil(argument)
>   32:   66 0f 73 d0 3fpsrlq   xmm0, 63
>   37:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) 
> ? -0.0 : 0.0
>   3c:   66 0f 56 c3   orpdxmm0, xmm3   # xmm0 = ceil(argument)
>   40:   c3  .L0:  ret
>   .end
>
> regards
> Stefan



-- 
BR,
Hongtao


Suboptimal code generated for __buitlin_floor on AMD64 without SS4_4.1

2021-08-05 Thread Stefan Kanthak
Hi,

targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (19 instructions using 86 bytes, plus 6 quadwords
using 48 bytes) for __builtin_floor() when -msse4.1 is NOT given:

.text
   0:   f2 0f 10 15 10 00 00 00 movsd   .LC1(%rip), %xmm2
4: R_X86_64_PC32.rdata
   8:   f2 0f 10 25 00 00 00 00 movsd   .LC0(%rip), %xmm4
c: R_X86_64_PC32.rdata
  10:   66 0f 28 d8 movapd %xmm0, %xmm3
  14:   66 0f 28 c8 movapd %xmm0, %xmm1
  18:   66 0f 54 da andpd  %xmm2, %xmm3
  1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
  20:   76 33   jbe55 <_floor+0x55>
  22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
  27:   66 0f ef db pxor   %xmm3, %xmm3
  2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
  2f:   f2 48 0f 2a d8  cvtsi2sd %rax, %xmm3
  34:   66 0f 28 e3 movapd %xmm3, %xmm4
  38:   f2 0f c2 e0 06  cmpnlesd %xmm0, %xmm4
  3d:   f2 0f 10 05 20 00 00 00 movsd  .LC2(%rip), %xmm0
91: R_X86_64_PC32   .rdata
  45:   66 0f 54 e0 andpd  %xmm0, %xmm4
  49:   f2 0f 5c dc subsd  %xmm4, %xmm3
  4d:   66 0f 28 c3 movapd %xmm3, %xmm0
  51:   66 0f 56 c2 orpd   %xmm2, %xmm0
  55:   c3  retq

.rdata
.align 8
   0:   00 00 00 00 .LC0:   .quad  0x1.0p52
00 00 30 43
00 00 00 00
00 00 00 00
.align 16
  10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
ff ff ff 7f
  18:   00 00 00 00 .quad  0.0
00 00 00 00
.align 8
  20:   00 00 00 00 .LC2:   .quad  0x1.0p0
00 00 f0 3f
00 00 00 00
00 00 00 00
.end

JFTR: in the best case, the memory accesses cost several cycles,
  while in the worst case they yield a page fault!


Properly optimized, shorter and faster code, using only 15 instructions
in just 65 bytes, WITHOUT superfluous constants, thus avoiding costly
memory accesses and saving at least 16 + 48 bytes, follows:

  .intel_syntax
  .equBIAS, 1023
  .text
   0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
   5:   48 f7 d8  neg rax
# jz  .L0  # argument zero?
   8:   70 36 jo  .L0  # argument indefinite?
   # argument overflows 64-bit 
integer?
   a:   48 f7 d8  neg rax
   d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
  12:   48 a1 00 00 00mov rax, (1 << 63) | (BIAS << 52)
  19:   00 00 00 f0 bf
  1c:   66 48 0f 6e d0movqxmm2, rax# xmm2 = -0x1.0p0
  21:   f2 0f 10 d8   movsd   xmm3, xmm0   # xmm3 = argument
  25:   f2 0f c2 d9 01cmpltsd xmm3, xmm1   # xmm3 = (argument < 
trunc(argument)) ? ~0L : 0L
  2a:   66 0f 54 da   andpd   xmm3, xmm2   # xmm3 = (argument < 
trunc(argument)) ? -1.0 : 0.0
  2e:   f2 0f 58 d9   addsd   xmm3, xmm1   # xmm3 = (argument < 
trunc(argument)) ? -1.0 : 0.0
   #  + trunc(argument)
   #  = floor(argument)
  32:   66 0f 73 d0 3fpsrlq   xmm0, 63
  37:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) ? 
-0.0 : 0.0
  3c:   66 0f 56 c3   orpdxmm0, xmm3   # xmm0 = floor(argument)
  40:   c3  .L0:  ret
  .end

regards
Stefan



Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-05 Thread Prathamesh Kulkarni via Gcc
On Wed, 4 Aug 2021 at 18:30, Richard Earnshaw
 wrote:
>
> On 04/08/2021 13:46, Segher Boessenkool wrote:
> > On Wed, Aug 04, 2021 at 05:20:58PM +0530, Prathamesh Kulkarni wrote:
> >> On Wed, 4 Aug 2021 at 15:49, Segher Boessenkool
> >>  wrote:
> >>> Both __builtin_constant_p and __is_constexpr will not work in your use
> >>> case (since a function argument is not a constant, let alone an ICE).
> >>> It only becomes a constant value later on.  The manual (for the former)
> >>> says:
> >>>   You may use this built-in function in either a macro or an inline
> >>>   function. However, if you use it in an inlined function and pass an
> >>>   argument of the function as the argument to the built-in, GCC never
> >>>   returns 1 when you call the inline function with a string constant or
> >>>   compound literal (see Compound Literals) and does not return 1 when you
> >>>   pass a constant numeric value to the inline function unless you specify
> >>>   the -O option.
> >> Indeed, that's why I was thinking if we should use an attribute to mark 
> >> param as
> >> a constant, so during type-checking the function call, the compiler
> >> can emit a diagnostic if the passed arg
> >> is not a constant.
> >
> > That will depend on the vagaries of what optimisations the compiler
> > managed to do :-(
> >
> >> Alternatively -- as you suggest, we could define a new builtin, say
> >> __builtin_ice(x) that returns true if 'x' is an ICE.
> >
> > (That is a terrible name, it's not clear at all to the reader, just
> > write it out?  It is fun if you know what it means, but infuriating
> > otherwise.)
> >
> >> And wrap the intrinsic inside a macro that would check if the arg is an 
> >> ICE ?
> >
> > That will work yeah.  Maybe not as elegant as you'd like, but not all
> > that bad, and it *works*.  Well, hopefully it does :-)
> >
> >> For eg:
> >>
> >> __extension__ extern __inline int32x2_t
> >> __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
> >> vshl_n_s32_1 (int32x2_t __a, const int __b)
> >> {
> >>   return __builtin_neon_vshl_nv2si (__a, __b);
> >> }
> >>
> >> #define vshl_n_s32(__a, __b) \
> >> ({ typeof (__a) a = (__a); \
> >>_Static_assert (__builtin_constant_p ((__b)), #__b " is not an
> >> integer constant"); \
> >>vshl_n_s32_1 (a, (__b)); })
> >>
> >> void f(int32x2_t x, const int y)
> >> {
> >>   vshl_n_s32 (x, 2);
> >>   vshl_n_s32 (x, y);
> >>
> >>   int z = 1;
> >>   vshl_n_s32 (x, z);
> >> }
> >>
> >> With this, the compiler rejects vshl_n_s32 (x, y) and vshl_n_s32 (x,
> >> z) at all optimization levels since neither 'y' nor 'z' is an ICE.
> >
> > You used __builtin_constant_p though, which works differently, so the
> > test is not conclusive, might not show what you want to show.
> >
> >> Instead of __builtin_constant_p, we could use __builtin_ice.
> >> Would that be a reasonable approach ?
> >
> > I think it will work, yes.
> >
> >> But this changes the semantics of intrinsic from being an inline
> >> function to a macro, and I am not sure if that's a good idea.
> >
> > Well, what happens if you call the actual builtin directly, with some
> > non-constant parameter?  That just fails with a more cryptic error,
> > right?  So you can view this as some syntactic sugar to make these
> > intrinsics easier to use.
> >
> > Hrm I now remember a place I could have used this:
> >
> > #define mtspr(n, x) do { asm("mtspr %1,%0" : : "r"(x), "n"(n)); } while (0)
> > #define mfspr(n) ({ \
> >   u32 x; asm volatile("mfspr %0,%1" : "=r"(x) : "n"(n)); x; \
> > })
> >
> > It is quite similar to your builtin code really, and I did resort to
> > macros there, for similar reasons :-)
> >
> >
> > Segher
> >
>
> We don't want to have to resort to macros.  Not least because at some
> point we want to replace the content of arm_neon.h with a single #pragma
> directive to remove all the parsing of the header that's needed.  What's
> more, if we had a suitable pragma we'd stand a fighting chance of being
> able to extend support to other languages as well that don't use the
> pre-processor, such as Fortran or Ada (not that that is on the cards
> right now).
Hi,
IIUC, a more general issue here, is that the intrinsics require
special type-checking of arguments, beyond what is dictated by the
Standard.
An argument needing to be an ICE could be seen as one instance.

So perhaps, should there be some mechanism to tell the FE to let the
target do additional checking for a particular function call, say by
explicitly marking
it with "intrinsic" attribute ? So while type checking a call to a
function marked with "intrinsic" attribute, FE can invoke target
handler with name of function and
corresponding arguments passed, and then leave it to the target for
further checking ?
For vshl_n case, the target hook would check that the 2nd arg is an
integer constant within the permissible range.

I propose to do this only for intrinsics that need special checking
and can be entirely implemented with C extensions and won't 

Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-05 Thread Richard Earnshaw via Gcc
On 04/08/2021 18:59, Segher Boessenkool wrote:
> On Wed, Aug 04, 2021 at 07:08:08PM +0200, Florian Weimer wrote:
>> * Segher Boessenkool:
>>
>>> On Wed, Aug 04, 2021 at 03:27:00PM +0100, Richard Earnshaw wrote:
 On 04/08/2021 14:40, Segher Boessenkool wrote:
> On Wed, Aug 04, 2021 at 02:00:42PM +0100, Richard Earnshaw wrote:
>> We don't want to have to resort to macros.  Not least because at some
>> point we want to replace the content of arm_neon.h with a single #pragma
>> directive to remove all the parsing of the header that's needed.  What's
>> more, if we had a suitable pragma we'd stand a fighting chance of being
>> able to extend support to other languages as well that don't use the
>> pre-processor, such as Fortran or Ada (not that that is on the cards
>> right now).
>
> So how do you want to handle constants-that-are-not-yet-constant, say
> before inlining?  And how do you want to deal with those possibly not
> ever becoming constant, perhaps because you used a too low "n" in -On
> (but there are very many random other causes)?  And, what *is* a
> constant, anyway?  This is even more fuzzy if you consider those
> other languages as well.
>
> (Does skipping parsing of some trivial header save so much time?  Huh!)

 Trivial?  arm_neon.h is currently 20k lines of source.  What's more, it 
 has to support inline functions that might not be available when the 
 header is parsed, but might become available if the user subsequently 
 compiles a function with different attributes enabled.  It is very 
 definitely *NOT* trivial.
>>>
>>> Ha yes :-)  I just assumed without looking that it would be like other
>>> architectures' intrinsics headers.  Whoops.
>>
>> But isn't it?
>>
>> $ echo '#include ' | gcc -E - | wc -l
>> 41045
> 
> $ echo '#include ' | gcc -E - -maltivec | wc -l
> 9
> 
> Most of this file (774 lines) is #define's, which take essentially no
> time at all.  And none of the other archs I have looked at have big
> headers either!
> 
> 
> Segher
> 

arm_sve.h isn't large either, but that's because all it contains (other
than a couple of typedefs is

#pragma GCC aarch64 "arm_sve.h"

:)

R.


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Gabriel Paubert
On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> Hi,
> 
> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> following code (13 instructions using 57 bytes, plus 4 quadwords
> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> 
> .text
>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> 4: R_X86_64_PC32.rdata
>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> c: R_X86_64_PC32.rdata
>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
>   18:   66 0f 54 da andpd  %xmm2, %xmm3
>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
>   20:   76 16   jbe38 <_trunc+0x38>
>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
>   27:   66 0f ef c0 pxor   %xmm0, %xmm0
>   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
>   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
>   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
>   38:   c3  retq
> 
> .rdata
> .align 8
>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> 00 00 30 43
> 00 00 00 00
> 00 00 00 00
> .align 16
>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> ff ff ff 7f
>   18:   00 00 00 00 .quad  0.0
> 00 00 00 00
> .end
> 
> JFTR: in the best case, the memory accesses cost several cycles,
>   while in the worst case they yield a page fault!
> 
> 
> Properly optimized, shorter and faster code, using but only 9 instructions
> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> and saving at least 16 + 32 bytes, follows:
> 
>   .intel_syntax
>   .text
>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
>5:   48 f7 d8  neg rax
> # jz  .L0  # argument zero?
>8:   70 16 jo  .L0  # argument indefinite?
># argument overflows 
> 64-bit integer?
>a:   48 f7 d8  neg rax
>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) 
> ? -0.0 : 0.0
>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
>   20:   c3  .L0:  ret
>   .end

There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer.  So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.

The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).

Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.

Apart from this, in your code, after cvttsd2si I'd rather use:
mov rcx,rax # make a second copy to a scratch register
neg rcx
jo .L0
cvtsi2sd xmm1,rax

The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.

With your patch:

cvttsd2si-->neg-?->neg-->cvtsi2sd
  
where the ? means that the following instructions are speculated.  

With an auxiliary register there are two dependency chains:

cvttsd2si-?->cvtsi2sd
 |->mov->neg->jump

Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.

Regards,
Gabriel



> 
> regards
> Stefan
 



Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Richard Biener via Gcc
On Thu, Aug 5, 2021 at 11:44 AM Gabriel Paubert  wrote:
>
> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> > Hi,
> >
> > targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> > following code (13 instructions using 57 bytes, plus 4 quadwords
> > using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >
> > .text
> >0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> > 4: R_X86_64_PC32.rdata
> >8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> > c: R_X86_64_PC32.rdata
> >   10:   66 0f 28 d8 movapd %xmm0, %xmm3
> >   14:   66 0f 28 c8 movapd %xmm0, %xmm1
> >   18:   66 0f 54 da andpd  %xmm2, %xmm3
> >   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
> >   20:   76 16   jbe38 <_trunc+0x38>
> >   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
> >   27:   66 0f ef c0 pxor   %xmm0, %xmm0
> >   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
> >   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
> >   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
> >   38:   c3  retq
> >
> > .rdata
> > .align 8
> >0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> > 00 00 30 43
> > 00 00 00 00
> > 00 00 00 00
> > .align 16
> >   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> > ff ff ff 7f
> >   18:   00 00 00 00 .quad  0.0
> > 00 00 00 00
> > .end
> >
> > JFTR: in the best case, the memory accesses cost several cycles,
> >   while in the worst case they yield a page fault!
> >
> >
> > Properly optimized, shorter and faster code, using but only 9 instructions
> > in just 33 bytes, WITHOUT any constants, thus avoiding costly memory 
> > accesses
> > and saving at least 16 + 32 bytes, follows:
> >
> >   .intel_syntax
> >   .text
> >0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
> >5:   48 f7 d8  neg rax
> > # jz  .L0  # argument zero?
> >8:   70 16 jo  .L0  # argument indefinite?
> ># argument overflows 
> > 64-bit integer?
> >a:   48 f7 d8  neg rax
> >d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> > -0.0) ? -0.0 : 0.0
> >   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
> >   20:   c3  .L0:  ret
> >   .end
>
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.  So
> using your code may require some option (-fast-math comes to mind), or
> you need at least a check on the exponent before cvttsd2si.
>
> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).
>
> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
>
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax
>
> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
>
> With your patch:
>
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>
> where the ? means that the following instructions are speculated.
>
> With an auxiliary register there are two dependency chains:
>
> cvttsd2si-?->cvtsi2sd
>  |->mov->neg->jump
>
> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.

Btw, the code to emit these sequences is in
gcc/config/i386/i386-expand.c:ix86_expand_trunc
and friends.

Richard.

> Regards,
> Gabriel
>
>
>
> >
> > regards
> > Stefan
>
>


Re: Function attribute to indicate a likely (or unlikely) return value

2021-08-05 Thread Martin Liška

On 7/25/21 7:33 PM, Dominique Pellé via Gcc wrote:


Hi


Hello.








I read https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html



but was left wondering: is there a way to annotate a function



to indicate that a return value is likely (or unlikely)?


Interesting idea :) No, we don't support that right now.








For example, let's say we have this function:







   // Return OK (=0) in case of success (frequent case)



   // or an error code != 0 in case of failure (rare case).



   int do_something();







If it's unlikely to fail, I wish I could declare the function like



this (pseudo-code!):







   int do_something() __likely_return(OK);







So wherever it's used, the optimizer can optimize branch



prediction and the instruction cache.  In other words, lines



like this:






   if (do_something() == OK)> 



...  would implicitly be similar to:







   // LIKELY defined as __builtin_expect((x), 1).



   if (LIKELY(do_something() == OK))







The advantage of being able to annotate the declaration,



is that we only need to annotate once in the header, and



all uses of the function can benefit from the optimization



without polluting/modifying all code where the function



is called.


I see your point, seems like a good idea. The question is,

how much would it take to implement and what's benefit

of the suggested hints.








Another example: a function that would be unlikely to



return NULL could be declared as:







   void *foo() __unlikely_returns(NULL);


Note that modern CPUs have branch predictors and a condition

of 'if (ptr == 0)' can be guessed quite easily.








This last example would be a bit similar to the



__attribute__((malloc)) since I read about it in the doc:







In addition, the GCC predicts that a function with



the attribute returns non-null in most cases.







Of course __attribute__((malloc)) gives other guarantees



(return value cannot alias any other pointer) so it's not



equivalent.


Note we have a special branch probability for malloc:

gcc/predict.def:54








Would attribute __likely_return() and  __unlikely_return()



make sense?


Similarly, we have now:



/* Branch to basic block containing call marked by noreturn attribute.  */

DEF_PREDICTOR (PRED_NORETURN, "noreturn call", PROB_VERY_LIKELY,

   PRED_FLAG_FIRST_MATCH)



Thanks for the ideas,

Martin








Is there already a way to achieve this which I missed in



the doc?







Regards



Dominique








Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Stefan Kanthak
Gabriel Paubert  wrote:


> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> Hi,
>> 
>> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
>> following code (13 instructions using 57 bytes, plus 4 quadwords
>> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
>> 
>> .text
>>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
>> 4: R_X86_64_PC32.rdata
>>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
>> c: R_X86_64_PC32.rdata
>>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
>>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
>>   18:   66 0f 54 da andpd  %xmm2, %xmm3
>>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
>>   20:   76 16   jbe38 <_trunc+0x38>
>>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
>>   27:   66 0f ef c0 pxor   %xmm0, %xmm0
>>   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
>>   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
>>   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
>>   38:   c3  retq
>> 
>> .rdata
>> .align 8
>>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
>> 00 00 30 43
>> 00 00 00 00
>> 00 00 00 00
>> .align 16
>>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
>> ff ff ff 7f
>>   18:   00 00 00 00 .quad  0.0
>> 00 00 00 00
>> .end
>> 
>> JFTR: in the best case, the memory accesses cost several cycles,
>>   while in the worst case they yield a page fault!
>> 
>> 
>> Properly optimized, shorter and faster code, using but only 9 instructions
>> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
>> and saving at least 16 + 32 bytes, follows:
>> 
>>   .intel_syntax
>>   .text
>>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
>>5:   48 f7 d8  neg rax
>> # jz  .L0  # argument zero?
>>8:   70 16 jo  .L0  # argument indefinite?
>># argument overflows 
>> 64-bit integer?
>>a:   48 f7 d8  neg rax
>>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
>>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
>> -0.0) ? -0.0 : 0.0
>>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
>>   20:   c3  .L0:  ret
>>   .end
> 
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.

Right, I overlooked this fault. Thanks for pointing out.

> So using your code may require some option (-fast-math comes to mind),
> or you need at least a check on the exponent before cvttsd2si.

The whole idea behind these implementations is to get rid of loading
floating-point constants to perform comparisions.

> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).

Preserving the sign of -0.0 is explicitly specified in the standard,
and is cheap, as shown in my code.

> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
> 
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax

I don't know how GCC generates the code for builtins, and what kind of
templates it uses: the second goal was to minimize register usage.

> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
> 
> With your patch:
> 
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>  
> where the ? means that the following instructions are speculated.  
> 
> With an auxiliary register there are two dependency chains:
> 
> cvttsd2si-?->cvtsi2sd
> |->mov->neg->jump

Correct; see above: I expect the template(s) for builtins to give
the register allocator some freedom to split code paths and resolve
dependency chains.

> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.

Right, the conversions dominate both the original and the code I posted.
It's easy to get rid of them, with still slightly shorter and faster
branchless code (17 instructions, 84 bytes, instead of 

Re: Optional machine prefix for programs in for -B dirs, match ing Clang

2021-08-05 Thread Michael Matz via Gcc
Hello,

On Wed, 4 Aug 2021, John Ericson wrote:

> On Wed, Aug 4, 2021, at 10:48 AM, Michael Matz wrote:
> > ... the 'as' and 'ld' executables should be simply found within the 
> > version and target specific GCC libexecsubdir, possibly by being symlinks 
> > to whatever you want.  That's at least how my crosss are configured and 
> > installed, without any --with-{as,ld} options.
> 
> Yes that does work, and that's probably the best option today. I'm just 
> a little wary of unprefixing things programmatically.

The libexecsubdir _is_ the prefix in above case :)

> For some context, this is NixOS where we assemble a ton of cross 
> compilers automatically and each package gets its own isolated many FHS. 
> For that reason I would like to eventually avoid the target-specific 
> subdirs entirely, as I have the separate package trees to disambiguate 
> things. Now, I know that exact same argument could also be used to say 
> target prefixing is also superfluous, but eventually things on the PATH 
> need to be disambiguated.

Sure, which is why (e.g.) cross binutils do install with an arch prefix 
into ${bindir}.  But as GCC has the capability to look into libexecsubdir 
for binaries as well (which quite surely should never be in $PATH on any 
system), I don't see the conflict.

> There is no requirement that the libexec things be named like the bin 
> things, but I sort of feel it's one less thing to remember and makes 
> debugging easier.

Well, the naming scheme of binaries in libexecsubdir reflects the scheme 
that the compilers are using: cc1, cc1plus etc.  Not 
aarch64-unknown-linux-cc1.

> I am sympathetic to the issue that if GCC accepts everything Clang does 
> and vice-versa, we'll Postel's-law ourselves ourselves over time into 
> madness as mistakes are accumulated rather than weeded out.

Right.  I supposed it wouldn't hurt to also look for "${targettriple}-as" 
in $PATH before looking for 'as' (in $PATH).  But I don't think we can (or 
should) switch off looking for 'as' in libexecsubdir.  I don't even see 
why that behaviour should depend on an option, it could just be added by 
default.

> I now have some patches for this change I suppose I could also submit.

Even better :)


Ciao,
Michael.


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Gabriel Ravier via Gcc



On 8/5/21 11:42 AM, Gabriel Paubert wrote:

On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:

Hi,

targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (13 instructions using 57 bytes, plus 4 quadwords
using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:

 .text
0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
 4: R_X86_64_PC32.rdata
8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
 c: R_X86_64_PC32.rdata
   10:   66 0f 28 d8 movapd %xmm0, %xmm3
   14:   66 0f 28 c8 movapd %xmm0, %xmm1
   18:   66 0f 54 da andpd  %xmm2, %xmm3
   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
   20:   76 16   jbe38 <_trunc+0x38>
   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
   27:   66 0f ef c0 pxor   %xmm0, %xmm0
   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
   38:   c3  retq

 .rdata
 .align 8
0:   00 00 00 00 .LC0:   .quad  0x1.0p52
 00 00 30 43
 00 00 00 00
 00 00 00 00
 .align 16
   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
 ff ff ff 7f
   18:   00 00 00 00 .quad  0.0
 00 00 00 00
 .end

JFTR: in the best case, the memory accesses cost several cycles,
   while in the worst case they yield a page fault!


Properly optimized, shorter and faster code, using but only 9 instructions
in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
and saving at least 16 + 32 bytes, follows:

   .intel_syntax
   .text
0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
5:   48 f7 d8  neg rax
 # jz  .L0  # argument zero?
8:   70 16 jo  .L0  # argument indefinite?
# argument overflows 64-bit 
integer?
a:   48 f7 d8  neg rax
d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) 
? -0.0 : 0.0
   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
   20:   c3  .L0:  ret
   .end

There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer.  So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.

The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).

`-fno-signed-zeros` does that, if you need it


Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.

Apart from this, in your code, after cvttsd2si I'd rather use:
mov rcx,rax # make a second copy to a scratch register
neg rcx
jo .L0
cvtsi2sd xmm1,rax

The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.

With your patch:

cvttsd2si-->neg-?->neg-->cvtsi2sd
   
where the ? means that the following instructions are speculated.


With an auxiliary register there are two dependency chains:

cvttsd2si-?->cvtsi2sd
  |->mov->neg->jump

Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.

Regards,
Gabriel




regards
Stefan
  


--
_
Gabriel RAVIER
First year student at Epitech
+33 6 36 46 16 43
gabriel.rav...@epitech.eu
11 Quai Finkwiller
67000 STRASBOURG



Question about finding parameters in function bodies from SSA variables

2021-08-05 Thread Erick Ochoa via Gcc
Hello Richard,

I'm still working on the points-to analysis and I am happy to say that
after reviewing the ipa-cp code I was able to generate summaries for
local variables, ssa variables, heap variables, global variables and
functions. I am also using the callback hooks to find out if
cgraph_nodes and varpool_nodes are added or deleted between
read_summaries and execute. Even though I don't update the solutions
between execute and function_transform yet, I am reading the points-to
pairs and remapping the constraint variables back to trees during
function_transform and printing the name of pointer-pointee pairs.

This is still very much a work in progress and a very weak points-to
analysis. I have almost finished my Andersen's / field insensitive /
context insensitive / flow-insensitive / intraprocedural analysis with
the LTO framework (without interacting with other transformations
yet). The only thing that I am missing is assigning parameters to be
pointing to NONLOCAL memory upon entry to the function and perhaps
some corner cases where gimple is not exactly how I expect it to be.

I am wondering, none of the variables in
function->gimple_df->ssa_names and function->local_decls are
PARM_DECL. I'm also not entirely sure if I should be looking for
PARM_DECLs since looking at function bodies' gimple representation I
don't see the formal parameters being used inside the function.
Instead, it appears that some SSA variables are automatically
initialized with the parameter value. Is this the case?

For example, for a function:

foo (struct a* $NAME)

The variable $NAME is nowhere used inside the function. I also found
that there is an ssa variable in location X ( in
function->gimple_df->ssa_names[X]) named with a variation like
$NAME_$X(D) and this seems to correspond to the parameter $NAME. How
can one (preferably looking only at
function->gimple_df->ssa_names[$X]) find out that this tree
corresponds to a parameter?

Many thanks!
-Erick


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Gabriel Paubert
Hi,

On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert  wrote:
> 
> 
> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> >> Hi,
> >> 
> >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> >> following code (13 instructions using 57 bytes, plus 4 quadwords
> >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >> 
> >> .text
> >>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> >> 4: R_X86_64_PC32.rdata
> >>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> >> c: R_X86_64_PC32.rdata
> >>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
> >>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
> >>   18:   66 0f 54 da andpd  %xmm2, %xmm3
> >>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
> >>   20:   76 16   jbe38 <_trunc+0x38>
> >>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
> >>   27:   66 0f ef c0 pxor   %xmm0, %xmm0
> >>   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
> >>   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
> >>   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
> >>   38:   c3  retq
> >> 
> >> .rdata
> >> .align 8
> >>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> >> 00 00 30 43
> >> 00 00 00 00
> >> 00 00 00 00
> >> .align 16
> >>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> >> ff ff ff 7f
> >>   18:   00 00 00 00 .quad  0.0
> >> 00 00 00 00
> >> .end
> >> 
> >> JFTR: in the best case, the memory accesses cost several cycles,
> >>   while in the worst case they yield a page fault!
> >> 
> >> 
> >> Properly optimized, shorter and faster code, using but only 9 instructions
> >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory 
> >> accesses
> >> and saving at least 16 + 32 bytes, follows:
> >> 
> >>   .intel_syntax
> >>   .text
> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
> >>5:   48 f7 d8  neg rax
> >> # jz  .L0  # argument zero?
> >>8:   70 16 jo  .L0  # argument indefinite?
> >># argument overflows 
> >> 64-bit integer?
> >>a:   48 f7 d8  neg rax
> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> >> -0.0) ? -0.0 : 0.0
> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
> >>   20:   c3  .L0:  ret
> >>   .end
> > 
> > There is one important difference, namely setting the invalid exception
> > flag when the parameter can't be represented in a signed integer.
> 
> Right, I overlooked this fault. Thanks for pointing out.
> 
> > So using your code may require some option (-fast-math comes to mind),
> > or you need at least a check on the exponent before cvttsd2si.
> 
> The whole idea behind these implementations is to get rid of loading
> floating-point constants to perform comparisions.

Indeed, but what I had in mind was something along the following lines:

movq rax,xmm0   # and copy rax to say rcx, if needed later
shrq rax,52 # move sign and exponent to 12 LSBs 
andl eax,0x7ff  # mask the sign
cmpl eax,0x434  # value to be checked
ja return   # exponent too large, we're done (what about NaNs?)
cvttsd2si rax,xmm0 # safe after exponent check
cvtsi2sd xmm0,rax  # conversion done

and a bit more to handle the corner cases (essentially preserve the
sign to be correct between -1 and -0.0). But the CPU can (speculatively) 
start the conversions early, so the dependency chain is rather short.

I don't know if it's faster than your new code, I'm almost sure that
it's shorter. Your new code also has a fairly long dependency chain.

> 
> > The last part of your code then goes to take into account the special
> > case of -0.0, which I most often don't care about (I'd like to have a
> > -fdont-split-hairs-about-the-sign-of-zero option).
> 
> Preserving the sign of -0.0 is explicitly specified in the standard,
> and is cheap, as shown in my code.
> 
> > Potentially generating spurious invalid operation and then carefully
> > taking into account the sign of zero does not seem very consistent.
> > 
> > Apart from this, in your code, after cvttsd2si I'd rather use:
> > mov rcx,rax # make a second copy to a scratch register
> > neg rcx
> > jo .L0
> > cvtsi2sd xmm1,r

Re: Noob question about simple customization of GCC.

2021-08-05 Thread David Malcolm via Gcc
On Wed, 2021-08-04 at 00:17 -0700, Alacaster Soi via Gcc wrote:
> How hard would it be to add a tree-like structure and
> headers/sections to
> the -v gcc option so you can see the call structure. Would this be a
> reasonable first contribution/customization for a noob? It'll be a
> while
> before I can reasonably work on this.
> GCC
> version
> config
> >  cc1 main.c
>   | cc1 config and
>   | output
> -> tempfile.s
>     '*extra space' *between each
> lowest
> level command
> >  as -v
>   | output
> -> tempfile.o
> 
> >  collect2.exe
>   | output
>   |- ld.exe
>  | output
> -> tempfile.exe
> 

I really like this UI idea, but I don't know how easy/hard it would be
to implement.  The code that implements figuring out what to invoke
(the "driver") is in gcc/gcc.c, which is a big source file.

FWIW there's also code in gcc/tree-diagnostic-path.cc to emit ASCII art
that does something a bit similar to your idea, which might be worth
looking at (in this case, to visualize function calls and returns along
a code path).

Hope this is helpful
Dave



Re: daily report on extending static analyzer project [GSoC]

2021-08-05 Thread Ankur Saini via Gcc



> On 05-Aug-2021, at 4:56 AM, David Malcolm  wrote:
> 
> On Wed, 2021-08-04 at 21:32 +0530, Ankur Saini wrote:
> 
> [...snip...]
>> 
>> - From observation, a typical vfunc call that isn't devirtualised by
>> the compiler's front end looks something like this 
>> "OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5(D))"
>> where "a_ptr_5(D)" is pointer that is being used to call the virtual
>> function.
>> 
>> - We can access it's region to see what is the type of the object the
>> pointer is actually pointing to.
>> 
>> - This is then used to find a call with DECL_CONTEXT of the object
>> from the all the possible targets of that polymorphic call.
> 
> [...]
> 
>> 
>> Patch file ( prototype ) : 
>> 
> 
>> +  /* Call is possibly a polymorphic call.
>> +  
>> + In such case, use devirtisation tools to find 
>> + possible callees of this function call.  */
>> +  
>> +  function *fun = get_current_function ();
>> +  gcall *stmt  = const_cast (call);
>> +  cgraph_edge *e = cgraph_node::get (fun->decl)->get_edge (stmt);
>> +  if (e->indirect_info->polymorphic)
>> +  {
>> +void *cache_token;
>> +bool final;
>> +vec  targets
>> +  = possible_polymorphic_call_targets (e, &final, &cache_token, true);
>> +if (!targets.is_empty ())
>> +  {
>> +tree most_propbable_taget = NULL_TREE;
>> +if(targets.length () == 1)
>> +return targets[0]->decl;
>> +
>> +/* From the current state, check which subclass the pointer that 
>> +   is being used to this polymorphic call points to, and use to
>> +   filter out correct function call.  */
>> +tree t_val = gimple_call_arg (call, 0);
> 
> Maybe rename to "this_expr"?
> 
> 
>> +const svalue *sval = get_rvalue (t_val, ctxt);
> 
> and "this_sval"?

ok

> 
> ...assuming that that's what the value is.
> 
> Probably should reject the case where there are zero arguments.

Ideally it should always have one argument representing the pointer used to 
call the function. 

for example, if the function is called like this : -

a_ptr->foo(arg);  // where foo() is a virtual function and a_ptr is a pointer 
to an object of a subclass.

I saw that it’s GIMPLE representation is as follows : -

OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5, arg);

> 
> 
>> +
>> +const region *reg
>> +  = [&]()->const region *
>> +  {
>> +switch (sval->get_kind ())
>> +  {
>> +case SK_INITIAL:
>> +  {
>> +const initial_svalue *initial_sval
>> +  = sval->dyn_cast_initial_svalue ();
>> +return initial_sval->get_region ();
>> +  }
>> +  break;
>> +case SK_REGION:
>> +  {
>> +const region_svalue *region_sval 
>> +  = sval->dyn_cast_region_svalue ();
>> +return region_sval->get_pointee ();
>> +  }
>> +  break;
>> +
>> +default:
>> +  return NULL;
>> +  }
>> +  } ();
> 
> I think the above should probably be a subroutine.
> 
> That said, it's not clear to me what it's doing, or that this is correct.


Sorry, I think I should have explained it earlier.

Let's take an example code snippet :- 

Derived d;
Base *base_ptr;
base_ptr = &d;
base_ptr->foo();// where foo() is a virtual function

This genertes the following GIMPLE dump :- 

Derived::Derived (&d);
base_ptr_6 = &d.D.3779;
_1 = base_ptr_6->_vptr.Base;
_2 = _1 + 8;
_3 = *_2;
OBJ_TYPE_REF(_3;(struct Base)base_ptr_6->1) (base_ptr_6);

Here instead of trying to extract virtual pointer from the call and see which 
subclass it belongs, I found it simpler to extract the actual pointer which is 
used to call the function itself (which from observation, is always the first 
parameter of the call) and used the region model at that point to figure out 
what is the type of the object it actually points to ultimately get the actual 
subclass who's function is being called here. :)

Now let me try to explain how I actually executed it ( A lot of assumptions 
here are based on observation, so please correct me wherever you think I made a 
false interpretation or forgot about a certain special case ) :

- once it is confirmed that the call that we are dealing with is a polymorphic 
call ( via the cgraph edge representing the call ), I used the 
"possible_polymorphic_call_targets ()" from ipa-utils.h ( defined in 
ipa-devirt.c ), to get the possible callee of that call. 

  function *fun = get_current_function ();
  gcall *stmt  = const_cast (call);
  cgraph_edge *e = cgraph_node::get (fun->decl)->get_edge (stmt);
  if (e->indirect_info->polymorphic)
  {
void *cache_token;
bool final;
vec  targets
  = possible_polymorphic_call_targ

Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-05 Thread Segher Boessenkool
On Thu, Aug 05, 2021 at 02:31:02PM +0530, Prathamesh Kulkarni wrote:
> On Wed, 4 Aug 2021 at 18:30, Richard Earnshaw
>  wrote:
> > We don't want to have to resort to macros.  Not least because at some
> > point we want to replace the content of arm_neon.h with a single #pragma
> > directive to remove all the parsing of the header that's needed.  What's
> > more, if we had a suitable pragma we'd stand a fighting chance of being
> > able to extend support to other languages as well that don't use the
> > pre-processor, such as Fortran or Ada (not that that is on the cards
> > right now).
> Hi,
> IIUC, a more general issue here, is that the intrinsics require
> special type-checking of arguments, beyond what is dictated by the
> Standard.
> An argument needing to be an ICE could be seen as one instance.
> 
> So perhaps, should there be some mechanism to tell the FE to let the
> target do additional checking for a particular function call, say by

An integer constant expression can be checked by the frontend itself, it
does not depend on optimisation etc.  That is the beauty of it: it is a)
more local, and b) a more reliable / less surprising thing to use.

But it *is* less powerful than "it is a constant integer after a travel
through the bowels of the compiler".  Which of course is less reliable
and more surprising (think what happens if you use -O0 or -O1 or -Og or
-Os or any -fno- etc.)  So it will be a lot more maintenance work
(answering PRs about it is only the start).


Segher


gcc-9-20210805 is now available

2021-08-05 Thread GCC Administrator via Gcc
Snapshot gcc-9-20210805 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/9-20210805/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 9 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch releases/gcc-9 
revision 11e2ac8f75060d9be432e8db1f358298a75c98d4

You'll find:

 gcc-9-20210805.tar.xzComplete GCC

  SHA256=4ee185d8c6144cebf81cd01ab68c8d64f8b097765f2278ec00882368e9dcfbcc
  SHA1=39fe1b99542d66d02d17131a7f297958439bc2ed

Diffs from 9-20210729 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-9
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: daily report on extending static analyzer project [GSoC]

2021-08-05 Thread David Malcolm via Gcc
On Thu, 2021-08-05 at 20:27 +0530, Ankur Saini wrote:
> 
> 
> > On 05-Aug-2021, at 4:56 AM, David Malcolm 
> > wrote:
> > 
> > On Wed, 2021-08-04 at 21:32 +0530, Ankur Saini wrote:
> > 
> > [...snip...]
> > > 
> > > - From observation, a typical vfunc call that isn't devirtualised
> > > by
> > > the compiler's front end looks something like this 
> > > "OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5(D))"
> > > where "a_ptr_5(D)" is pointer that is being used to call the
> > > virtual
> > > function.
> > > 
> > > - We can access it's region to see what is the type of the object
> > > the
> > > pointer is actually pointing to.
> > > 
> > > - This is then used to find a call with DECL_CONTEXT of the object
> > > from the all the possible targets of that polymorphic call.
> > 
> > [...]
> > 
> > > 
> > > Patch file ( prototype ) : 
> > > 
> > 
> > > +  /* Call is possibly a polymorphic call.
> > > +  
> > > + In such case, use devirtisation tools to find 
> > > + possible callees of this function call.  */
> > > +  
> > > +  function *fun = get_current_function ();
> > > +  gcall *stmt  = const_cast (call);
> > > +  cgraph_edge *e = cgraph_node::get (fun->decl)->get_edge (stmt);
> > > +  if (e->indirect_info->polymorphic)
> > > +  {
> > > +    void *cache_token;
> > > +    bool final;
> > > +    vec  targets
> > > +  = possible_polymorphic_call_targets (e, &final,
> > > &cache_token, true);
> > > +    if (!targets.is_empty ())
> > > +  {
> > > +    tree most_propbable_taget = NULL_TREE;
> > > +    if(targets.length () == 1)
> > > +   return targets[0]->decl;
> > > +    
> > > +    /* From the current state, check which subclass the
> > > pointer that 
> > > +   is being used to this polymorphic call points to, and
> > > use to
> > > +   filter out correct function call.  */
> > > +    tree t_val = gimple_call_arg (call, 0);
> > 
> > Maybe rename to "this_expr"?
> > 
> > 
> > > +    const svalue *sval = get_rvalue (t_val, ctxt);
> > 
> > and "this_sval"?
> 
> ok
> 
> > 
> > ...assuming that that's what the value is.
> > 
> > Probably should reject the case where there are zero arguments.
> 
> Ideally it should always have one argument representing the pointer
> used to call the function. 
> 
> for example, if the function is called like this : -
> 
> a_ptr->foo(arg);  // where foo() is a virtual function and a_ptr is a
> pointer to an object of a subclass.
> 
> I saw that it’s GIMPLE representation is as follows : -
> 
> OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5, arg);
> 
> > 
> > 
> > > +
> > > +    const region *reg
> > > +  = [&]()->const region *
> > > +  {
> > > +    switch (sval->get_kind ())
> > > +  {
> > > +    case SK_INITIAL:
> > > +  {
> > > +    const initial_svalue *initial_sval
> > > +  = sval->dyn_cast_initial_svalue ();
> > > +    return initial_sval->get_region ();
> > > +  }
> > > +  break;
> > > +    case SK_REGION:
> > > +  {
> > > +    const region_svalue *region_sval 
> > > +  = sval->dyn_cast_region_svalue ();
> > > +    return region_sval->get_pointee ();
> > > +  }
> > > +  break;
> > > +
> > > +    default:
> > > +  return NULL;
> > > +  }
> > > +  } ();
> > 
> > I think the above should probably be a subroutine.
> > 
> > That said, it's not clear to me what it's doing, or that this is
> > correct.
> 
> 
> Sorry, I think I should have explained it earlier.
> 
> Let's take an example code snippet :- 
> 
> Derived d;
> Base *base_ptr;
> base_ptr = &d;
> base_ptr->foo();// where foo() is a virtual function
> 
> This genertes the following GIMPLE dump :- 
> 
> Derived::Derived (&d);
> base_ptr_6 = &d.D.3779;
> _1 = base_ptr_6->_vptr.Base;
> _2 = _1 + 8;
> _3 = *_2;
> OBJ_TYPE_REF(_3;(struct Base)base_ptr_6->1) (base_ptr_6);

I did a bit of playing with this example, and tried adding:

1876case OBJ_TYPE_REF:
1877  gcc_unreachable ();
1878  break;

to region_model::get_rvalue_1, and running cc1plus under the debugger.

The debugger hits the "gcc_unreachable ();", at this stmt:

 OBJ_TYPE_REF(_2;(struct Base)base_ptr_5->0) (base_ptr_5);

Looking at the region_model with region_model::debug() shows:

(gdb) call debug()
stack depth: 1
  frame (index 0): frame: ‘test’@1
clusters within frame: ‘test’@1
  cluster for: Derived d
key:   {bytes 0-7}
value: ‘int (*) () *’ {(&constexpr int (* Derived::_ZTV7Derived 
[3])(...)+(sizetype)16)}
  cluster for: base_ptr_5: &Derived d.
  cluster for: _2: &‘foo’
m_called_unknown_fn: FALSE
constraint_manager:
  equiv classes:
ec0: {&Derived d.}
ec1: {&

Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-05 Thread Martin Sebor via Gcc

On 8/4/21 3:46 AM, Richard Earnshaw wrote:



On 03/08/2021 18:44, Martin Sebor wrote:

On 8/3/21 4:11 AM, Prathamesh Kulkarni via Gcc wrote:
On Tue, 27 Jul 2021 at 13:49, Richard Biener 
 wrote:


On Mon, Jul 26, 2021 at 11:06 AM Prathamesh Kulkarni via Gcc
 wrote:


On Fri, 23 Jul 2021 at 23:29, Andrew Pinski  wrote:


On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc
 wrote:


Hi,
Continuing from this thread,
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
The proposal is to provide a mechanism to mark a parameter in a
function as a literal constant.

Motivation:
Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h:

__extension__ extern __inline int32x2_t
__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
vshl_n_s32 (int32x2_t __a, const int __b)
{
   return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
}

and it's caller:

int32x2_t f (int32x2_t x)
{
    return vshl_n_s32 (x, 1);
}


Can't you do similar to what is done already in the aarch64 back-end:
#define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0]))
#define __AARCH64_LANE_CHECK(__vec, __idx)  \
 __builtin_aarch64_im_lane_boundsi (sizeof(__vec),
sizeof(__vec[0]), __idx)

?
Yes this is about lanes but you could even add one for min/max which
is generic and such; add an argument to say the intrinsics name even.
You could do this as a non-target builtin if you want and reuse it
also for the aarch64 backend.

Hi Andrew,
Thanks for the suggestions. IIUC, we could use this approach to check
if the argument
falls within a certain range (min / max), but I am not sure how it
will help to determine
if the arg is a constant immediate ? AFAIK, vshl_n intrinsics require
that the 2nd arg is immediate ?

Even the current RTL builtin checking is not consistent across
optimization levels:
For eg:
int32x2_t f(int32_t *restrict a)
{
   int32x2_t v = vld1_s32 (a);
   int b = 2;
   return vshl_n_s32 (v, b);
}

With pristine trunk, compiling with -O2 results in no errors because
constant propagation replaces 'b' with 2, and during expansion,
expand_builtin_args is happy. But at -O0, it results in the error -
"argument 2 must be a constant immediate".

So I guess we need some mechanism to mark a parameter as a constant ?


I guess you want to mark it in a way that the frontend should force
constant evaluation and error if that's not possible?   C++ doesn't
allow to declare a parameter as 'constexpr' but something like

void foo (consteval int i);

since I guess you do want to allow passing constexpr arguments
in C++ or in C extended forms of constants like

static const int a[4];

foo (a[1]);

?  But yes, this looks useful to me.

Hi Richard,
Thanks for the suggestions and sorry for late response.
I have attached a prototype patch that implements consteval attribute.
As implemented, the attribute takes at least one argument(s), which
refer to parameter position,
and the corresponding parameter must be const qualified, failing
which, the attribute is ignored.


I'm curious why the argument must be const-qualified.  If it's
to keep it from being changed in ways that would prevent it from
being evaluated at compile-time in the body of the function then
to be effective, the enforcement of the constraint should be on
the definition of the function.  Otherwise, the const qualifier
could be used in a declaration of a function but left out from
a subsequent definition of it, letting it modify it, like so:

   __attribute__ ((consteval (1))) void f (const int);

   inline __attribute__ ((always_inline)) void f (int i) { ++i; }


In this particular case it's because the inline function is implementing 
an intrinsic operation in the architecture and the instruction only 
supports a literal constant value.  At present we catch this while 
trying to expand the intrinsic, but that can lead to poor diagnostics 
because we really want to report against the line of code calling the 
intrinsic.


Presumably the intrinsics can accept (or can be made to accept) any
constant integer expressions, not just literals.  E.g., the aarch64
builtin below accepts them.  For example, this is accepted in C++:

  __Int64x2_t void f (__Int32x2_t a)
  {
constexpr int n = 2;
return __builtin_aarch64_vshll_nv2si (a, n + 1);
  }

Making the intrinscis accept constant arguments in constexpr-like
functions and introducing a constexpr-lite attribute (for C code)
was what I was suggesting bythe constexpr comment below.  I'd find
that a much more general and more powerful design.

But my comment above was to highlight that if requiring the function
argument referenced by the proposed consteval attribute to be const
is necessary to prevent it from being modified then the requirement
needs to be enforced not on the declaration but on the definition.

You may rightly say: "but we get to define the inline arm function
wrappers so we'll make sure to never declare them that way."  I don't
have a problem with that.  What I am s

Re: Optional machine prefix for programs in for -B dirs, match ing Clang

2021-08-05 Thread John Ericson



On Thu, Aug 5, 2021, at 8:30 AM, Michael Matz wrote:
> Hello,
> 
> On Wed, 4 Aug 2021, John Ericson wrote:
> 
> > On Wed, Aug 4, 2021, at 10:48 AM, Michael Matz wrote:
> > > ... the 'as' and 'ld' executables should be simply found within the 
> > > version and target specific GCC libexecsubdir, possibly by being symlinks 
> > > to whatever you want.  That's at least how my crosss are configured and 
> > > installed, without any --with-{as,ld} options.
> > 
> > Yes that does work, and that's probably the best option today. I'm just 
> > a little wary of unprefixing things programmatically.
> 
> The libexecsubdir _is_ the prefix in above case :)

Right. I meant stripping off the `cpu-vendor-os-` (conventionally) that ld and 
as are prefixed with. stripping off leading directories is easier.

> > For some context, this is NixOS where we assemble a ton of cross 
> > compilers automatically and each package gets its own isolated many FHS. 
> > For that reason I would like to eventually avoid the target-specific 
> > subdirs entirely, as I have the separate package trees to disambiguate 
> > things. Now, I know that exact same argument could also be used to say 
> > target prefixing is also superfluous, but eventually things on the PATH 
> > need to be disambiguated.
> 
> Sure, which is why (e.g.) cross binutils do install with an arch prefix 
> into ${bindir}.  But as GCC has the capability to look into libexecsubdir 
> for binaries as well (which quite surely should never be in $PATH on any 
> system), I don't see the conflict.

Yes there is no actual conflict. Our originally wrapper scripts may have been 
confused about this at some point but that's on us.

> 
> > There is no requirement that the libexec things be named like the bin 
> > things, but I sort of feel it's one less thing to remember and makes 
> > debugging easier.
> 
> Well, the naming scheme of binaries in libexecsubdir reflects the scheme 
> that the compilers are using: cc1, cc1plus etc.  Not 
> aarch64-unknown-linux-cc1.

Right.

> 
> > I am sympathetic to the issue that if GCC accepts everything Clang does 
> > and vice-versa, we'll Postel's-law ourselves ourselves over time into 
> > madness as mistakes are accumulated rather than weeded out.
> 
> Right.  I supposed it wouldn't hurt to also look for "${targettriple}-as" 
> in $PATH before looking for 'as' (in $PATH).  But I don't think we can (or 
> should) switch off looking for 'as' in libexecsubdir.  I don't even see 
> why that behaviour should depend on an option, it could just be added by 
> default.

OK I agree with that. so if someone passes -B$x, how about looking for

- $x/$machine/$version/$prog
- $x/$machine/$prog
- $x/$machine-prog
- $x/prog

so no prefixing in the subdir, only in the main dir?

($libexecsubdir is morally $libexec being a search dir + subdir IIRC)

> > I now have some patches for this change I suppose I could also submit.
> 
> Even better :)

Great!

I will continue improving my patch based on the above. In the meantime, I 
posted https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576725.html which 
is a small cleanup that, while helping with my changes, doesn't change the 
behavior and I hope is good in any event.


Re: Why vectorization didn't turn on by -O2

2021-08-05 Thread Hongtao Liu via Gcc
On Thu, Aug 5, 2021 at 5:20 AM Segher Boessenkool
 wrote:
>
> On Wed, Aug 04, 2021 at 11:22:53AM +0100, Richard Sandiford wrote:
> > Segher Boessenkool  writes:
> > > On Wed, Aug 04, 2021 at 10:10:36AM +0100, Richard Sandiford wrote:
> > >> Richard Biener  writes:
> > >> > Alternatively only enable loop vectorization at -O2 (the above checks
> > >> > flag_tree_slp_vectorize as well).  At least the cost model kind
> > >> > does not have any influence on BB vectorization, that is, we get the
> > >> > same pros and cons as we do for -O3.
> > >>
> > >> Yeah, but a lot of the loop vector cost model choice is about controlling
> > >> code size growth and avoiding excessive runtime versioning tests.
> > >
> > > Both of those depend a lot on the target, and target-specific conditions
> > > as well (which CPU model is selected for example).  Can we factor that
> > > in somehow?  Maybe we need some target hook that returns the expected
> > > percentage code growth for vectorising a given loop, for example, and
> > > -O2 vs. -O3 then selects what percentage is acceptable.
> > >
> > >> BB SLP
> > >> should be a win on both code size and performance (barring significant
> > >> target costing issues).
> > >
> > > Yeah -- but this could use a similar hook as well (just a straightline
> > > piece of code instead of a loop).
> >
> > I think anything like that should be driven by motivating use cases.
> > It's not something that we can easily decide in the abstract.
> >
> > The results so far with using very-cheap at -O2 have been promising,
> > so I don't think new hooks should block that becoming the default.
>
> Right, but it wouldn't hurt to think a sec if we are on the right path
> forward.  It's is crystal clear that to make good decisions about what
> and how to vectorise you need to take *some* target characteristics into
> account, and that will have to happen sooner rather than later.
>
> This was all in reply to
>
> > >> Yeah, but a lot of the loop vector cost model choice is about controlling
> > >> code size growth and avoiding excessive runtime versioning tests.
>
> It was not meant to hold up these patches :-)
>
> > >> PR100089 was an exception because we ended up keeping unvectorised
> > >> scalar code that would never have existed otherwise.  BB SLP proper
> > >> shouldn't have that problem.
> > >
> > > It also is a tiny piece of code.  There will always be tiny examples
> > > that are much worse (or much better) than average.
> >
> > Yeah, what makes PR100089 important isn't IMO the test itself, but the
> > underlying problem that the PR exposed.  Enabling this “BB SLP in loop
> > vectorisation” code can lead to the generation of scalar COND_EXPRs even
> > though we know that ifcvt doesn't have a proper cost model for deciding
> > whether scalar COND_EXPRs are a win.
> >
> > Introducing scalar COND_EXPRs at -O3 is arguably an acceptable risk
> > (although still dubious), but I think it's something we need to avoid
> > for -O2, even if that means losing the optimisation.
>
> Yeah -- -O2 should almost always do the right thing, while -O3 can do
> bad things more often, it just has to be better "on average".
>
>
> Segher

Move thread to gcc-patches and gcc

-- 
BR,
Hongtao