Re: Question about finding parameters in function bodies from SSA variables

2021-08-06 Thread Richard Biener via Gcc
On Thu, Aug 5, 2021 at 3:45 PM Erick Ochoa  wrote:
>
> Hello Richard,
>
> I'm still working on the points-to analysis and I am happy to say that
> after reviewing the ipa-cp code I was able to generate summaries for
> local variables, ssa variables, heap variables, global variables and
> functions. I am also using the callback hooks to find out if
> cgraph_nodes and varpool_nodes are added or deleted between
> read_summaries and execute. Even though I don't update the solutions
> between execute and function_transform yet, I am reading the points-to
> pairs and remapping the constraint variables back to trees during
> function_transform and printing the name of pointer-pointee pairs.
>
> This is still very much a work in progress and a very weak points-to
> analysis. I have almost finished my Andersen's / field insensitive /
> context insensitive / flow-insensitive / intraprocedural analysis with
> the LTO framework (without interacting with other transformations
> yet). The only thing that I am missing is assigning parameters to be
> pointing to NONLOCAL memory upon entry to the function and perhaps
> some corner cases where gimple is not exactly how I expect it to be.
>
> I am wondering, none of the variables in
> function->gimple_df->ssa_names and function->local_decls are
> PARM_DECL. I'm also not entirely sure if I should be looking for
> PARM_DECLs since looking at function bodies' gimple representation I
> don't see the formal parameters being used inside the function.
> Instead, it appears that some SSA variables are automatically
> initialized with the parameter value. Is this the case?
>
> For example, for a function:
>
> foo (struct a* $NAME)
>
> The variable $NAME is nowhere used inside the function. I also found
> that there is an ssa variable in location X ( in
> function->gimple_df->ssa_names[X]) named with a variation like
> $NAME_$X(D) and this seems to correspond to the parameter $NAME. How
> can one (preferably looking only at
> function->gimple_df->ssa_names[$X]) find out that this tree
> corresponds to a parameter?

Parameters that are written into SSA form have their default definition
represent the incoming value.  Such parameter SSA names are identified
by SSA_NAME_IS_DEFAULT_DEF (name) &&
SSA_NAME_VAR (name) && TREE_CODE (SSA_NAME_VAR (name)) == PARM_DECL
and the corresponding PARM_DECL is the SSA_NAME_VAR.  Parameters
not written into SSA will appear as PARM_DECL in the IL.  Note similar
things happen to DECL_BY_REFERENCE RESULT_DECLs, the pointer
will be in SSA form and the default def is the address of the return slot so
you might see *result_3(D) = x; in the IL.

See get_constraint_for_ssa_var for example.

>
> Many thanks!
> -Erick


Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-06 Thread Richard Earnshaw via Gcc




On 06/08/2021 01:06, Martin Sebor via Gcc wrote:

On 8/4/21 3:46 AM, Richard Earnshaw wrote:



On 03/08/2021 18:44, Martin Sebor wrote:

On 8/3/21 4:11 AM, Prathamesh Kulkarni via Gcc wrote:
On Tue, 27 Jul 2021 at 13:49, Richard Biener 
 wrote:


On Mon, Jul 26, 2021 at 11:06 AM Prathamesh Kulkarni via Gcc
 wrote:


On Fri, 23 Jul 2021 at 23:29, Andrew Pinski  
wrote:


On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc
 wrote:


Hi,
Continuing from this thread,
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
The proposal is to provide a mechanism to mark a parameter in a
function as a literal constant.

Motivation:
Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h:

__extension__ extern __inline int32x2_t
__attribute__  ((__always_inline__, __gnu_inline__, 
__artificial__))

vshl_n_s32 (int32x2_t __a, const int __b)
{
   return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
}

and it's caller:

int32x2_t f (int32x2_t x)
{
    return vshl_n_s32 (x, 1);
}


Can't you do similar to what is done already in the aarch64 
back-end:

#define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0]))
#define __AARCH64_LANE_CHECK(__vec, __idx)  \
 __builtin_aarch64_im_lane_boundsi (sizeof(__vec),
sizeof(__vec[0]), __idx)

?
Yes this is about lanes but you could even add one for min/max which
is generic and such; add an argument to say the intrinsics name 
even.

You could do this as a non-target builtin if you want and reuse it
also for the aarch64 backend.

Hi Andrew,
Thanks for the suggestions. IIUC, we could use this approach to check
if the argument
falls within a certain range (min / max), but I am not sure how it
will help to determine
if the arg is a constant immediate ? AFAIK, vshl_n intrinsics require
that the 2nd arg is immediate ?

Even the current RTL builtin checking is not consistent across
optimization levels:
For eg:
int32x2_t f(int32_t *restrict a)
{
   int32x2_t v = vld1_s32 (a);
   int b = 2;
   return vshl_n_s32 (v, b);
}

With pristine trunk, compiling with -O2 results in no errors because
constant propagation replaces 'b' with 2, and during expansion,
expand_builtin_args is happy. But at -O0, it results in the error -
"argument 2 must be a constant immediate".

So I guess we need some mechanism to mark a parameter as a constant ?


I guess you want to mark it in a way that the frontend should force
constant evaluation and error if that's not possible?   C++ doesn't
allow to declare a parameter as 'constexpr' but something like

void foo (consteval int i);

since I guess you do want to allow passing constexpr arguments
in C++ or in C extended forms of constants like

static const int a[4];

foo (a[1]);

?  But yes, this looks useful to me.

Hi Richard,
Thanks for the suggestions and sorry for late response.
I have attached a prototype patch that implements consteval attribute.
As implemented, the attribute takes at least one argument(s), which
refer to parameter position,
and the corresponding parameter must be const qualified, failing
which, the attribute is ignored.


I'm curious why the argument must be const-qualified.  If it's
to keep it from being changed in ways that would prevent it from
being evaluated at compile-time in the body of the function then
to be effective, the enforcement of the constraint should be on
the definition of the function.  Otherwise, the const qualifier
could be used in a declaration of a function but left out from
a subsequent definition of it, letting it modify it, like so:

   __attribute__ ((consteval (1))) void f (const int);

   inline __attribute__ ((always_inline)) void f (int i) { ++i; }


In this particular case it's because the inline function is 
implementing an intrinsic operation in the architecture and the 
instruction only supports a literal constant value.  At present we 
catch this while trying to expand the intrinsic, but that can lead to 
poor diagnostics because we really want to report against the line of 
code calling the intrinsic.


Presumably the intrinsics can accept (or can be made to accept) any
constant integer expressions, not just literals.  E.g., the aarch64
builtin below accepts them.  For example, this is accepted in C++:

   __Int64x2_t void f (__Int32x2_t a)
   {
     constexpr int n = 2;
     return __builtin_aarch64_vshll_nv2si (a, n + 1);
   }

Making the intrinscis accept constant arguments in constexpr-like
functions and introducing a constexpr-lite attribute (for C code)
was what I was suggesting bythe constexpr comment below.  I'd find
that a much more general and more powerful design.



Yes, it would be unfortunate if the rule made it impossible to avoid 
idiomatic const-exprs like those you would write in C++, or even those 
you'd write naturally in C:


#define foo (1u << 5)



But my comment above was to highlight that if requiring the function
argument referenced by the proposed consteval attribute to be const
is necessary to prevent it from being modified 

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Stefan Kanthak
Gabriel Paubert  wrote:

> Hi,
> 
> On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert  wrote:
>> 
>> 
>> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:

>> >>   .intel_syntax
>> >>   .text
>> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
>> >>5:   48 f7 d8  neg rax
>> >> # jz  .L0  # argument zero?
>> >>8:   70 16 jo  .L0  # argument indefinite?
>> >># argument overflows 
>> >> 64-bit integer?
>> >>a:   48 f7 d8  neg rax
>> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = 
>> >> trunc(argument)
>> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
>> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
>> >> -0.0) ? -0.0 : 0.0
>> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = 
>> >> trunc(argument)
>> >>   20:   c3  .L0:  ret
>> >>   .end
>> > 
>> > There is one important difference, namely setting the invalid exception
>> > flag when the parameter can't be represented in a signed integer.
>> 
>> Right, I overlooked this fault. Thanks for pointing out.
>> 
>> > So using your code may require some option (-fast-math comes to mind),
>> > or you need at least a check on the exponent before cvttsd2si.
>> 
>> The whole idea behind these implementations is to get rid of loading
>> floating-point constants to perform comparisions.
> 
> Indeed, but what I had in mind was something along the following lines:
> 
> movq rax,xmm0   # and copy rax to say rcx, if needed later
> shrq rax,52 # move sign and exponent to 12 LSBs 
> andl eax,0x7ff  # mask the sign
> cmpl eax,0x434  # value to be checked
> ja return   # exponent too large, we're done (what about NaNs?)
> cvttsd2si rax,xmm0 # safe after exponent check
> cvtsi2sd xmm0,rax  # conversion done
> 
> and a bit more to handle the corner cases (essentially preserve the
> sign to be correct between -1 and -0.0).

The sign of -0.0 is the only corner case and already handled in my code.
Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
preserved, as in the code GCC generates as well as my code.

> But the CPU can (speculatively) start the conversions early, so the
> dependency chain is rather short.

Correct.
 
> I don't know if it's faster than your new code,

It should be faster.

> I'm almost sure that it's shorter.

"neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
5+4+5+5+2=21 bytes.

JFTR: better use "add rax,rax; shr rax,53" instead of
  "shr rax,52; and eax,0x7ff" and save 2 bytes.

Complete properly optimized code for __builtin_trunc is then as follows
(11 instructions, 44 bytes):

.code64
.intel_syntax
.equBIAS, 1023
.text
movqrax, xmm0# rax = argument
add rax, rax
shr rax, 53  # rax = exponent of |argument|
cmp eax, BIAS + 53
jae .Lexit   # argument indefinite?
 # |argument| >= 0x1.0p53?
cvttsd2si rax, xmm0  # rax = trunc(argument)
cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
psrlq   xmm0, 63
psllq   xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
orpdxmm0, xmm1   # xmm0 = trunc(argument)
.L0:ret
.end

@Richard Biener (et. al.):

1. Is a primitive for "floating-point > 2**x", which generates such
   an "integer" code sequence, already available, at least for
   float/binary32 and double/binary64?

2. the procedural code generator for __builtin_trunc() etc.  uses
   __builtin_fabs() and __builtin_copysign() as building blocks.
   These would need to (and of course should) be modified to generate
   psllq/psrlq pairs instead of andpd/andnpd referencing a memory
   location with either -0.0 oder ~(-0.0).

For -ffast-math, where the sign of -0.0 is not handled and the spurios
invalid floating-point exception for |argument| >= 2**63 is acceptable,
it boils down to:

.code64
.intel_syntax
.equBIAS, 1023
.text
cvttsd2si rax, xmm0  # rax = trunc(argument)
jo  .Lexit   # argument indefinite?
 # |argument| > 0x1.0p63?
cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
.L0:ret
.end

[...]

>> Right, the conversions dominate both the original and the code I posted.
>> It's easy to get rid of them, with still slightly shorter and faster
>> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
>> 57 + 32 = 89 bytes):
>> 
>> .code64
>> .intel_syntax noprefix
>> .text
>>0:   48 b8 00 00 00 00 00 00 30 43   mov rax, 0x4330
>>a:   66 48 0f 6e d0  mo

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Richard Biener via Gcc
On Fri, Aug 6, 2021 at 2:47 PM Stefan Kanthak  wrote:
>
> Gabriel Paubert  wrote:
>
> > Hi,
> >
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert  wrote:
> >>
> >>
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>
> >> >>   .intel_syntax
> >> >>   .text
> >> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = 
> >> >> trunc(argument)
> >> >>5:   48 f7 d8  neg rax
> >> >> # jz  .L0  # argument zero?
> >> >>8:   70 16 jo  .L0  # argument 
> >> >> indefinite?
> >> >># argument overflows 
> >> >> 64-bit integer?
> >> >>a:   48 f7 d8  neg rax
> >> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = 
> >> >> trunc(argument)
> >> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> >> >> -0.0) ? -0.0 : 0.0
> >> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = 
> >> >> trunc(argument)
> >> >>   20:   c3  .L0:  ret
> >> >>   .end
> >> >
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >>
> >> Right, I overlooked this fault. Thanks for pointing out.
> >>
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >>
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> >
> > Indeed, but what I had in mind was something along the following lines:
> >
> > movq rax,xmm0   # and copy rax to say rcx, if needed later
> > shrq rax,52 # move sign and exponent to 12 LSBs
> > andl eax,0x7ff  # mask the sign
> > cmpl eax,0x434  # value to be checked
> > ja return   # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax  # conversion done
> >
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
>
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.
>
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
>
> Correct.
>
> > I don't know if it's faster than your new code,
>
> It should be faster.
>
> > I'm almost sure that it's shorter.
>
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
>
> JFTR: better use "add rax,rax; shr rax,53" instead of
>   "shr rax,52; and eax,0x7ff" and save 2 bytes.
>
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
>
> .code64
> .intel_syntax
> .equBIAS, 1023
> .text
> movqrax, xmm0# rax = argument
> add rax, rax
> shr rax, 53  # rax = exponent of |argument|
> cmp eax, BIAS + 53
> jae .Lexit   # argument indefinite?
>  # |argument| >= 0x1.0p53?
> cvttsd2si rax, xmm0  # rax = trunc(argument)
> cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> psrlq   xmm0, 63
> psllq   xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> orpdxmm0, xmm1   # xmm0 = trunc(argument)
> .L0:ret
> .end
>
> @Richard Biener (et. al.):
>
> 1. Is a primitive for "floating-point > 2**x", which generates such
>an "integer" code sequence, already available, at least for
>float/binary32 and double/binary64?

Not that I know, but it should be possible to craft that.

> 2. the procedural code generator for __builtin_trunc() etc.  uses
>__builtin_fabs() and __builtin_copysign() as building blocks.
>These would need to (and of course should) be modified to generate
>psllq/psrlq pairs instead of andpd/andnpd referencing a memory
>location with either -0.0 oder ~(-0.0).
>
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
>
> .code64
> .intel_syntax
> .equBIAS, 1023
> .text
> cvttsd2si rax, xmm0  # rax = trunc(argument)
> jo  .Lexit   # argument indefinite?
>  # |argument| > 0x1.0p63?
> cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
> .L0:ret
> .end
>
> [...]
>
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 instruction

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Gabriel Paubert
On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert  wrote:
> 
> > Hi,
> > 
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert  wrote:
> >> 
> >> 
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> 
> >> >>   .intel_syntax
> >> >>   .text
> >> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = 
> >> >> trunc(argument)
> >> >>5:   48 f7 d8  neg rax
> >> >> # jz  .L0  # argument zero?
> >> >>8:   70 16 jo  .L0  # argument 
> >> >> indefinite?
> >> >># argument overflows 
> >> >> 64-bit integer?
> >> >>a:   48 f7 d8  neg rax
> >> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = 
> >> >> trunc(argument)
> >> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> >> >> -0.0) ? -0.0 : 0.0
> >> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = 
> >> >> trunc(argument)
> >> >>   20:   c3  .L0:  ret
> >> >>   .end
> >> > 
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >> 
> >> Right, I overlooked this fault. Thanks for pointing out.
> >> 
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >> 
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> > 
> > Indeed, but what I had in mind was something along the following lines:
> > 
> > movq rax,xmm0   # and copy rax to say rcx, if needed later
> > shrq rax,52 # move sign and exponent to 12 LSBs 
> > andl eax,0x7ff  # mask the sign
> > cmpl eax,0x434  # value to be checked
> > ja return   # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax  # conversion done
> > 
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
> 
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.

I don't know what the standard says about NaNs in this case, I seem to
remember that arithmetic instructions typically produce QNaN when one of
the inputs is a NaN, whether signaling or not. 

> 
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
> 
> Correct.
>  
> > I don't know if it's faster than your new code,
> 
> It should be faster.
> 
> > I'm almost sure that it's shorter.
> 
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
> 
> JFTR: better use "add rax,rax; shr rax,53" instead of
>   "shr rax,52; and eax,0x7ff" and save 2 bytes.

Indeed, I don't have the exact size of instructions in my head,
especially since I've not written x86 assembly since the mid 90s.

In any case, with your last improvement, the code is now down to a
single 32 bit immediate constant. And I don't see how to eliminate it...

> 
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
> 
> .code64
> .intel_syntax
> .equBIAS, 1023
> .text
> movqrax, xmm0# rax = argument
> add rax, rax
> shr rax, 53  # rax = exponent of |argument|
> cmp eax, BIAS + 53
> jae .Lexit   # argument indefinite?

Maybe s/.Lexit/.L0/

>  # |argument| >= 0x1.0p53?
> cvttsd2si rax, xmm0  # rax = trunc(argument)
> cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> psrlq   xmm0, 63
> psllq   xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> orpdxmm0, xmm1   # xmm0 = trunc(argument)
> .L0:ret
> .end
> 

This looks nice.

> @Richard Biener (et. al.):
> 
> 1. Is a primitive for "floating-point > 2**x", which generates such
>an "integer" code sequence, already available, at least for
>float/binary32 and double/binary64?
> 
> 2. the procedural code generator for __builtin_trunc() etc.  uses
>__builtin_fabs() and __builtin_copysign() as building blocks.
>These would need to (and of course should) be modified to generate
>psllq/psrlq pairs instead of andpd/andnpd referencing a memory
>location with either -0.0 oder ~(-0.0).
> 
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
> 
> .code64

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Michael Matz via Gcc
Hello,

On Fri, 6 Aug 2021, Stefan Kanthak wrote:

> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,

This claim would need to be proven in the wild.  |argument| > 2**52 are 
already integer, and shouldn't generate a spurious exception from the 
various to-int conversions, not even in fast-math mode for some relevant 
set of applications (at least SPECcpu).

Btw, have you made speed measurements with your improvements?  The size 
improvements are obvious, but speed changes can be fairly unintuitive, 
e.g. there were old K8 CPUs where the memory loads for constants are 
actually faster than the equivalent sequence of shifting and masking for 
the >= compares.  That's an irrelevant CPU now, but it shows that 
intuition about speed consequences can be wrong.


Ciao,
Michael.


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Stefan Kanthak
Michael Matz  wrote:


> Hello,
> 
> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> 
>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> 
> This claim would need to be proven in the wild.

I should have left the "when" after the "and" which I originally had
written...

> |argument| > 2**52 are already integer, and shouldn't generate a spurious
> exception from the various to-int conversions, not even in fast-math mode
> for some relevant set of applications (at least SPECcpu).
> 
> Btw, have you made speed measurements with your improvements?

No.

> The size improvements are obvious, but speed changes can be fairly
> unintuitive, e.g. there were old K8 CPUs where the memory loads for
> constants are actually faster than the equivalent sequence of shifting
> and masking for the >= compares.  That's an irrelevant CPU now, but it
> shows that intuition about speed consequences can be wrong.

I know. I also know of CPUs that can't load a 16-byte wide XMM register
in one go, but had to split the load into 2 8-byte loads.

If the constant happens to be present in L1 cache, it MAY load as fast
as an immediate.
BUT: on current CPUs, the code GCC generates

movsd  .LC1(%rip), %xmm2
movsd  .LC0(%rip), %xmm4
movapd %xmm0, %xmm3
movapd %xmm0, %xmm1
andpd  %xmm2, %xmm3
ucomisd %xmm3, %xmm4
jbe38 <_trunc+0x38>
 
needs
- 4 cycles if the movsd are executed in parallel and the movapd are
  handled by the register renamer,
- 5 cycles if the movsd and the movapd are executed in parallel,
- 7 cycles else,
plus an unknown number of cycles if the constants are not in L1.
The proposed

movq   rax, xmm0
addrax, rax
shrrax, 53
cmpeax, 53+1023
jaereturn

needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
register renamer).

Stefan


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Stefan Kanthak
Gabriel Paubert  wrote:


> On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert  wrote:
>> 
>> > Hi,
>> > 
>> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:

[...]

>> >> The whole idea behind these implementations is to get rid of loading
>> >> floating-point constants to perform comparisions.
>> > 
>> > Indeed, but what I had in mind was something along the following lines:
>> > 
>> > movq rax,xmm0   # and copy rax to say rcx, if needed later
>> > shrq rax,52 # move sign and exponent to 12 LSBs 
>> > andl eax,0x7ff  # mask the sign
>> > cmpl eax,0x434  # value to be checked
>> > ja return   # exponent too large, we're done (what about NaNs?)
>> > cvttsd2si rax,xmm0 # safe after exponent check
>> > cvtsi2sd xmm0,rax  # conversion done
>> > 
>> > and a bit more to handle the corner cases (essentially preserve the
>> > sign to be correct between -1 and -0.0).
>> 
>> The sign of -0.0 is the only corner case and already handled in my code.
>> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
>> preserved, as in the code GCC generates as well as my code.
> 
> I don't know what the standard says about NaNs in this case, I seem to
> remember that arithmetic instructions typically produce QNaN when one of
> the inputs is a NaN, whether signaling or not. 


and its cousins as well as the C standard say

| If x is NaN, a NaN shall be returned.

That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.

>> > But the CPU can (speculatively) start the conversions early, so the
>> > dependency chain is rather short.
>> 
>> Correct.
>>  
>> > I don't know if it's faster than your new code,
>> 
>> It should be faster.
>> 
>> > I'm almost sure that it's shorter.
>> 
>> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
>> 5+4+5+5+2=21 bytes.
>> 
>> JFTR: better use "add rax,rax; shr rax,53" instead of
>>   "shr rax,52; and eax,0x7ff" and save 2 bytes.
> 
> Indeed, I don't have the exact size of instructions in my head,
> especially since I've not written x86 assembly since the mid 90s.
> 
> In any case, with your last improvement, the code is now down to a
> single 32 bit immediate constant. And I don't see how to eliminate it...
> 
>> 
>> Complete properly optimized code for __builtin_trunc is then as follows
>> (11 instructions, 44 bytes):
>> 
>> .code64
>> .intel_syntax
>> .equBIAS, 1023
>> .text
>> movqrax, xmm0# rax = argument
>> add rax, rax
>> shr rax, 53  # rax = exponent of |argument|
>> cmp eax, BIAS + 53
>> jae .Lexit   # argument indefinite?
> 
> Maybe s/.Lexit/.L0/

Surely!

>>  # |argument| >= 0x1.0p53?
>> cvttsd2si rax, xmm0  # rax = trunc(argument)
>> cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>> psrlq   xmm0, 63
>> psllq   xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>> orpdxmm0, xmm1   # xmm0 = trunc(argument)
>> .L0:ret
>> .end
>> 
> 
> This looks nice.

Let's see how to convince GCC to generate such code sequences...

Stefan


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Michael Matz via Gcc
Hello,

On Fri, 6 Aug 2021, Stefan Kanthak wrote:

> >> For -ffast-math, where the sign of -0.0 is not handled and the 
> >> spurios invalid floating-point exception for |argument| >= 2**63 is 
> >> acceptable,
> > 
> > This claim would need to be proven in the wild.
> 
> I should have left the "when" after the "and" which I originally had 
> written...
> 
> > |argument| > 2**52 are already integer, and shouldn't generate a 
> > spurious exception from the various to-int conversions, not even in 
> > fast-math mode for some relevant set of applications (at least 
> > SPECcpu).
> > 
> > Btw, have you made speed measurements with your improvements?
> 
> No.
> 
> > The size improvements are obvious, but speed changes can be fairly 
> > unintuitive, e.g. there were old K8 CPUs where the memory loads for 
> > constants are actually faster than the equivalent sequence of shifting 
> > and masking for the >= compares.  That's an irrelevant CPU now, but it 
> > shows that intuition about speed consequences can be wrong.
> 
> I know. I also know of CPUs that can't load a 16-byte wide XMM register 
> in one go, but had to split the load into 2 8-byte loads.
> 
> If the constant happens to be present in L1 cache, it MAY load as fast
> as an immediate.
> BUT: on current CPUs, the code GCC generates
> 
> movsd  .LC1(%rip), %xmm2
> movsd  .LC0(%rip), %xmm4
> movapd %xmm0, %xmm3
> movapd %xmm0, %xmm1
> andpd  %xmm2, %xmm3
> ucomisd %xmm3, %xmm4
> jbe38 <_trunc+0x38>
>  
> needs
> - 4 cycles if the movsd are executed in parallel and the movapd are
>   handled by the register renamer,
> - 5 cycles if the movsd and the movapd are executed in parallel,
> - 7 cycles else,
> plus an unknown number of cycles if the constants are not in L1.

You also need to consider the case that the to-int converters are called 
in a loop (which ultimately are the only interesting cases for 
performance), where it's possible to load the constants before the loop 
and keep them in registers (at the expense of two register pressure of 
course) effectively removing the loads from cost considerations.  It's all 
tough choices, which is why stuff needs to be measured in some contexts 
:-)

(I do like your sequences btw, it's just not 100% clearcut that they are 
always a speed improvement).


Ciao,
Michael.

> The proposed
> 
> movq   rax, xmm0
> addrax, rax
> shrrax, 53
> cmpeax, 53+1023
> jaereturn
> 
> needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
> register renamer).
> 
> Stefan
> 


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Richard Biener via Gcc
On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak 
 wrote:
>Michael Matz  wrote:
>
>
>> Hello,
>> 
>> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>> 
>>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
>> 
>> This claim would need to be proven in the wild.
>
>I should have left the "when" after the "and" which I originally had
>written...
>
>> |argument| > 2**52 are already integer, and shouldn't generate a spurious
>> exception from the various to-int conversions, not even in fast-math mode
>> for some relevant set of applications (at least SPECcpu).
>> 
>> Btw, have you made speed measurements with your improvements?
>
>No.
>
>> The size improvements are obvious, but speed changes can be fairly
>> unintuitive, e.g. there were old K8 CPUs where the memory loads for
>> constants are actually faster than the equivalent sequence of shifting
>> and masking for the >= compares.  That's an irrelevant CPU now, but it
>> shows that intuition about speed consequences can be wrong.
>
>I know. I also know of CPUs that can't load a 16-byte wide XMM register
>in one go, but had to split the load into 2 8-byte loads.
>
>If the constant happens to be present in L1 cache, it MAY load as fast
>as an immediate.
>BUT: on current CPUs, the code GCC generates
>
>movsd  .LC1(%rip), %xmm2
>movsd  .LC0(%rip), %xmm4
>movapd %xmm0, %xmm3
>movapd %xmm0, %xmm1
>andpd  %xmm2, %xmm3
>ucomisd %xmm3, %xmm4
>jbe38 <_trunc+0x38>
> 
>needs
>- 4 cycles if the movsd are executed in parallel and the movapd are
>  handled by the register renamer,
>- 5 cycles if the movsd and the movapd are executed in parallel,
>- 7 cycles else,
>plus an unknown number of cycles if the constants are not in L1.
>The proposed
>
>movq   rax, xmm0

The xmm to GPR move costs you an extra cycle in latency. Shifts also tend to be 
port constrained. The original sequences are also somewhat straight forward to 
vectorize. 

>addrax, rax
>shrrax, 53
>cmpeax, 53+1023
>jaereturn
>
>needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
>register renamer).
>
>Stefan



Re: daily report on extending static analyzer project [GSoC]

2021-08-06 Thread Ankur Saini via Gcc



> On 06-Aug-2021, at 4:39 AM, David Malcolm  wrote:
> 
> On Thu, 2021-08-05 at 20:27 +0530, Ankur Saini wrote:
>> 
>> 
>>> On 05-Aug-2021, at 4:56 AM, David Malcolm 
>>> wrote:
>>> 
>>> On Wed, 2021-08-04 at 21:32 +0530, Ankur Saini wrote:
>>> 
>>> [...snip...]
 
 - From observation, a typical vfunc call that isn't devirtualised
 by
 the compiler's front end looks something like this 
 "OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5(D))"
 where "a_ptr_5(D)" is pointer that is being used to call the
 virtual
 function.
 
 - We can access it's region to see what is the type of the object
 the
 pointer is actually pointing to.
 
 - This is then used to find a call with DECL_CONTEXT of the object
 from the all the possible targets of that polymorphic call.
>>> 
>>> [...]
>>> 
 
 Patch file ( prototype ) : 
 
>>> 
 +  /* Call is possibly a polymorphic call.
 +  
 + In such case, use devirtisation tools to find 
 + possible callees of this function call.  */
 +  
 +  function *fun = get_current_function ();
 +  gcall *stmt  = const_cast (call);
 +  cgraph_edge *e = cgraph_node::get (fun->decl)->get_edge (stmt);
 +  if (e->indirect_info->polymorphic)
 +  {
 +void *cache_token;
 +bool final;
 +vec  targets
 +  = possible_polymorphic_call_targets (e, &final,
 &cache_token, true);
 +if (!targets.is_empty ())
 +  {
 +tree most_propbable_taget = NULL_TREE;
 +if(targets.length () == 1)
 +   return targets[0]->decl;
 +
 +/* From the current state, check which subclass the
 pointer that 
 +   is being used to this polymorphic call points to, and
 use to
 +   filter out correct function call.  */
 +tree t_val = gimple_call_arg (call, 0);
>>> 
>>> Maybe rename to "this_expr"?
>>> 
>>> 
 +const svalue *sval = get_rvalue (t_val, ctxt);
>>> 
>>> and "this_sval"?
>> 
>> ok
>> 
>>> 
>>> ...assuming that that's what the value is.
>>> 
>>> Probably should reject the case where there are zero arguments.
>> 
>> Ideally it should always have one argument representing the pointer
>> used to call the function. 
>> 
>> for example, if the function is called like this : -
>> 
>> a_ptr->foo(arg);  // where foo() is a virtual function and a_ptr is a
>> pointer to an object of a subclass.
>> 
>> I saw that it’s GIMPLE representation is as follows : -
>> 
>> OBJ_TYPE_REF(_2;(struct A)a_ptr_5(D)->0) (a_ptr_5, arg);
>> 
>>> 
>>> 
 +
 +const region *reg
 +  = [&]()->const region *
 +  {
 +switch (sval->get_kind ())
 +  {
 +case SK_INITIAL:
 +  {
 +const initial_svalue *initial_sval
 +  = sval->dyn_cast_initial_svalue ();
 +return initial_sval->get_region ();
 +  }
 +  break;
 +case SK_REGION:
 +  {
 +const region_svalue *region_sval 
 +  = sval->dyn_cast_region_svalue ();
 +return region_sval->get_pointee ();
 +  }
 +  break;
 +
 +default:
 +  return NULL;
 +  }
 +  } ();
>>> 
>>> I think the above should probably be a subroutine.
>>> 
>>> That said, it's not clear to me what it's doing, or that this is
>>> correct.
>> 
>> 
>> Sorry, I think I should have explained it earlier.
>> 
>> Let's take an example code snippet :- 
>> 
>> Derived d;
>> Base *base_ptr;
>> base_ptr = &d;
>> base_ptr->foo();// where foo() is a virtual function
>> 
>> This genertes the following GIMPLE dump :- 
>> 
>> Derived::Derived (&d);
>> base_ptr_6 = &d.D.3779;
>> _1 = base_ptr_6->_vptr.Base;
>> _2 = _1 + 8;
>> _3 = *_2;
>> OBJ_TYPE_REF(_3;(struct Base)base_ptr_6->1) (base_ptr_6);
> 
> I did a bit of playing with this example, and tried adding:
> 
> 1876  case OBJ_TYPE_REF:
> 1877gcc_unreachable ();
> 1878break;
> 
> to region_model::get_rvalue_1, and running cc1plus under the debugger.
> 
> The debugger hits the "gcc_unreachable ();", at this stmt:
> 
> OBJ_TYPE_REF(_2;(struct Base)base_ptr_5->0) (base_ptr_5);
> 
> Looking at the region_model with region_model::debug() shows:
> 
> (gdb) call debug()
> stack depth: 1
>  frame (index 0): frame: ‘test’@1
> clusters within frame: ‘test’@1
>  cluster for: Derived d
>key:   {bytes 0-7}
>value: ‘int (*) () *’ {(&constexpr int (* Derived::_ZTV7Derived 
> [3])(...)+(sizetype)16)}
>  cluster for: base_ptr_5: &Derived d.
>  cluster for: _2: &‘foo’
> m_called_unknown_fn: FALSE
> cons

'hash_map>'

2021-08-06 Thread Thomas Schwinge
Hi!

So I'm trying to do some C++...  ;-)

Given:

/* A map from SSA names or var decls to record fields.  */
typedef hash_map field_map_t;

/* For each propagation record type, this is a map from SSA names or var 
decls
   to propagate, to the field in the record type that should be used for
   transmission and reception.  */
typedef hash_map record_field_map_t;

Thus, that's a 'hash_map>'.  (I may do that,
right?)  Looking through GCC implementation files, very most of all uses
of 'hash_map' boil down to pointer key ('tree', for example) and
pointer/integer value.

Then:

record_field_map_t field_map ([...]); // see below
for ([...])
  {
tree record_type = [...];
[...]
bool existed;
field_map_t &fields
  = field_map.get_or_insert (record_type, &existed);
gcc_checking_assert (!existed);
[...]
for ([...])
  fields.put ([...], [...]);
[...]
  }
[stuff that looks up elements from 'field_map']
field_map.empty ();

This generally works.

If I instantiate 'record_field_map_t field_map (40);', Valgrind is happy.
If however I instantiate 'record_field_map_t field_map (13);' (where '13'
would be the default for 'hash_map'), Valgrind complains:

2,080 bytes in 10 blocks are definitely lost in loss record 828 of 876
   at 0x483DD99: calloc (vg_replace_malloc.c:762)
   by 0x175F010: xcalloc (xmalloc.c:162)
   by 0xAF4A2C: hash_table, tree_node*> 
>::hash_entry, false, xcallocator>::hash_table(unsigned long, bool, bool, bool, 
mem_alloc_origin) (hash-table.h:275)
   by 0x15E0120: hash_map, tree_node*> 
>::hash_map(unsigned long, bool, bool, bool) (hash-map.h:143)
   by 0x15DEE87: hash_map, tree_node*> >, 
simple_hashmap_traits, hash_map, tree_node*> 
> > >::get_or_insert(tree_node* const&, bool*) (hash-map.h:205)
   by 0x15DD52C: execute_omp_oacc_neuter_broadcast() 
(omp-oacc-neuter-broadcast.cc:1371)
   [...]

(That's with '#pragma GCC optimize "O0"' at the top of the 'gcc/*.cc'
file.)

My suspicion was that it is due to the 'field_map' getting resized as it
incrementally grows (and '40' being big enough for that to never happen),
and somehow the non-POD (?) value objects not being properly handled
during that.  Working my way a bit through 'gcc/hash-map.*' and
'gcc/hash-table.*' (but not claiming that I understand all that, off
hand), it seems as if my theory is right: I'm able to plug this memory
leak as follows:

--- gcc/hash-table.h
+++ gcc/hash-table.h
@@ -820,6 +820,8 @@ hash_table::expand ()
 {
   value_type *q = find_empty_slot_for_expand (Descriptor::hash 
(x));
  new ((void*) q) value_type (std::move (x));
+ //BAD Descriptor::remove (x); // (doesn't make sense and) a ton of 
"Invalid read [...] inside a block of size [...] free'd"
+ x.~value_type (); //GOOD This seems to work!  -- but does it make 
sense?
 }

   p++;

However, that doesn't exactly look like a correct fix, does it?  I'd
expect such a manual destructor call in combination with placement new
(that is being used here, obviously) -- but this is after 'std::move'?
However, this also survives a smoke-test-like run of parts of the GCC
testsuite, bootstrap and complete run now ongoing.


Grüße
 Thomas
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Stefan Kanthak
Richard Biener  wrote:

> On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak 
>  wrote:
>>Michael Matz  wrote:

>>> Btw, have you made speed measurements with your improvements?
>>
>>No.
[...]
>>If the constant happens to be present in L1 cache, it MAY load as fast
>>as an immediate.
>>BUT: on current CPUs, the code GCC generates
>>
>>movsd  .LC1(%rip), %xmm2
>>movsd  .LC0(%rip), %xmm4
>>movapd %xmm0, %xmm3
>>movapd %xmm0, %xmm1
>>andpd  %xmm2, %xmm3
>>ucomisd %xmm3, %xmm4
>>jbe38 <_trunc+0x38>
>> 
>>needs
>>- 4 cycles if the movsd are executed in parallel and the movapd are
>>  handled by the register renamer,
>>- 5 cycles if the movsd and the movapd are executed in parallel,
>>- 7 cycles else,
>>plus an unknown number of cycles if the constants are not in L1.
>>The proposed
>>
>>movq   rax, xmm0
>
> The xmm to GPR move costs you an extra cycle in latency. Shifts also
> tend to be port constrained. The original sequences are also somewhat
> straight forward to vectorize. 

Please show how GCC vectorizes CVT[T]SD2SI and CVTSI2SD!
These are the bottlenecks in the current code.

If you want the code for trunc() and cousins to be vectorizable you
should stay with the alternative code I presented some posts before,
which GCC should be (able to) generate from its other procedural
variant.

Stefan


Re: gcc_assert() and inhibit_libc

2021-08-06 Thread Sebastian Huber

On 22/07/2021 14:15, Richard Biener wrote:

On Wed, Jul 21, 2021 at 2:45 PM Sebastian Huber
  wrote:

Hello,

while testing this patch

https://www.google.com/search?client=firefox-b-e&q=gcc+enable_runtime_checking

I noticed that __gcov_info_to_gcda() uses abort(). This is due to (from
tsystem.h):

#ifdef ENABLE_RUNTIME_CHECKING
#define gcc_assert(EXPR) ((void)(!(EXPR) ? abort (), 0 : 0))
#else
/* Include EXPR, so that unused variable warnings do not occur.  */
#define gcc_assert(EXPR) ((void)(0 && (EXPR)))
#endif

In tsystem.h there is this if inhibit_libc is defined:

#ifndef abort
extern void abort (void) __attribute__ ((__noreturn__));
#endif

Who is supposed to define abort here optionally? Can this be defined for
example by a target configuration header like gcc/config/rtems.h?

I suppose for inhibit_libc we could use __builtin_trap () (but that might
expand to abort() on some targets)


In case of RTEMS, the application and the operating system is statically 
linked into one executable. The more features an application uses the 
bigger will be the executable. The abort() function pulls in a lot of 
stuff since it uses signals and may attempt to close all open streams. 
It would be nice if the gcc_assert() could be customized by the target 
configuration. For RTEMS we could use the Newlib defined __assert_func() 
from :


# define assert(__e) ((__e) ? (void)0 : __assert_func (__FILE__, __LINE__, \
   __ASSERT_FUNC, #__e))


void __assert_func (const char *, int, const char *, const char *)
_ATTRIBUTE ((__noreturn__));


--
embedded brains GmbH
Herr Sebastian HUBER
Dornierstr. 4
82178 Puchheim
Germany
email: sebastian.hu...@embedded-brains.de
phone: +49-89-18 94 741 - 16
fax:   +49-89-18 94 741 - 08

Registergericht: Amtsgericht München
Registernummer: HRB 157899
Vertretungsberechtigte Geschäftsführer: Peter Rasmussen, Thomas Dörfler
Unsere Datenschutzerklärung finden Sie hier:
https://embedded-brains.de/datenschutzerklaerung/


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Joseph Myers
On Fri, 6 Aug 2021, Stefan Kanthak wrote:

> > I don't know what the standard says about NaNs in this case, I seem to
> > remember that arithmetic instructions typically produce QNaN when one of
> > the inputs is a NaN, whether signaling or not. 
> 
> 
> and its cousins as well as the C standard say
> 
> | If x is NaN, a NaN shall be returned.
> 
> That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.

You should be looking at TS 18661-3 / C2x Annex F for sNaN handling; the 
POSIX attempts to deal with signaling NaNs aren't well thought out.  (And 
possibly adding flag_signaling_nans conditions as appropriate to disable 
for -fsignaling-nans anything for these to-integer operations that doesn't 
produce a qNaN with INVALID raised from sNaN input.)  Though in C2x mode, 
these SSE2 code sequences won't be used by default anyway, except for rint 
(C2x implies -fno-fp-int-builtin-inexact).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: 'hash_map>'

2021-08-06 Thread Jonathan Wakely via Gcc
On Fri, 6 Aug 2021, 17:58 Thomas Schwinge,  wrote:

> Hi!
>
> So I'm trying to do some C++...  ;-)
>
> Given:
>
> /* A map from SSA names or var decls to record fields.  */
> typedef hash_map field_map_t;
>
> /* For each propagation record type, this is a map from SSA names or
> var decls
>to propagate, to the field in the record type that should be used
> for
>transmission and reception.  */
> typedef hash_map record_field_map_t;
>
> Thus, that's a 'hash_map>'.  (I may do that,
> right?)  Looking through GCC implementation files, very most of all uses
> of 'hash_map' boil down to pointer key ('tree', for example) and
> pointer/integer value.
>
> Then:
>
> record_field_map_t field_map ([...]); // see below
> for ([...])
>   {
> tree record_type = [...];
> [...]
> bool existed;
> field_map_t &fields
>   = field_map.get_or_insert (record_type, &existed);
> gcc_checking_assert (!existed);
> [...]
> for ([...])
>   fields.put ([...], [...]);
> [...]
>   }
> [stuff that looks up elements from 'field_map']
> field_map.empty ();
>
> This generally works.
>
> If I instantiate 'record_field_map_t field_map (40);', Valgrind is happy.
> If however I instantiate 'record_field_map_t field_map (13);' (where '13'
> would be the default for 'hash_map'), Valgrind complains:
>
> 2,080 bytes in 10 blocks are definitely lost in loss record 828 of 876
>at 0x483DD99: calloc (vg_replace_malloc.c:762)
>by 0x175F010: xcalloc (xmalloc.c:162)
>by 0xAF4A2C: hash_table simple_hashmap_traits, tree_node*>
> >::hash_entry, false, xcallocator>::hash_table(unsigned long, bool, bool,
> bool, mem_alloc_origin) (hash-table.h:275)
>by 0x15E0120: hash_map simple_hashmap_traits, tree_node*>
> >::hash_map(unsigned long, bool, bool, bool) (hash-map.h:143)
>by 0x15DEE87: hash_map simple_hashmap_traits, tree_node*> >,
> simple_hashmap_traits, hash_map tree_node*, simple_hashmap_traits,
> tree_node*> > > >::get_or_insert(tree_node* const&, bool*) (hash-map.h:205)
>by 0x15DD52C: execute_omp_oacc_neuter_broadcast()
> (omp-oacc-neuter-broadcast.cc:1371)
>[...]
>
> (That's with '#pragma GCC optimize "O0"' at the top of the 'gcc/*.cc'
> file.)
>
> My suspicion was that it is due to the 'field_map' getting resized as it
> incrementally grows (and '40' being big enough for that to never happen),
> and somehow the non-POD (?) value objects not being properly handled
> during that.  Working my way a bit through 'gcc/hash-map.*' and
> 'gcc/hash-table.*' (but not claiming that I understand all that, off
> hand), it seems as if my theory is right: I'm able to plug this memory
> leak as follows:
>
> --- gcc/hash-table.h
> +++ gcc/hash-table.h
> @@ -820,6 +820,8 @@ hash_table::expand ()
>  {
>value_type *q = find_empty_slot_for_expand
> (Descriptor::hash (x));
>   new ((void*) q) value_type (std::move (x));
> + //BAD Descriptor::remove (x); // (doesn't make sense and) a ton
> of "Invalid read [...] inside a block of size [...] free'd"
> + x.~value_type (); //GOOD This seems to work!  -- but does it
> make sense?
>  }
>
>p++;
>
> However, that doesn't exactly look like a correct fix, does it?  I'd
> expect such a manual destructor call in combination with placement new
> (that is being used here, obviously) -- but this is after 'std::move'?
> However, this also survives a smoke-test-like run of parts of the GCC
> testsuite, bootstrap and complete run now ongoing.
>

Does GCC's hash_map assume you only use it to store POD (plain old data)
types, which don't need to be destroyed, because they don't have any
dynamically allocated memory or other resources?

A hash_map is not a POD, because it does have dynamically allocated memory.

If my guess is right, then hash_map should really use a static_assert to
enforce that requirement, instead of letting you use it in a way that will
leak.


Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-06 Thread Martin Uecker via Gcc
> On Wed, 4 Aug 2021 at 03:27, Segher Boessenkool
>  wrote:
> >
> > Hi!
> >
> > On Fri, Jul 23, 2021 at 04:23:42PM +0530, Prathamesh Kulkarni via Gcc wrote:
> > > The constraint here is that, vshl_n intrinsics require that the
> > > second arg (__b),
> > > should be an immediate value.
> >
> > Something that matches the "n" constraint, not necessarily a literal,
> > but stricter than just "immediate".  It probably is a good idea to allow
> > only "integer constant expression"s, so that the validity of the source
> > code does not depend on what the optimisers do with the code.
> >
> > > As Richard suggested, sth like:
> > > void foo(int x __attribute__((literal_constant (min_val, max_val)));
> >
> > The Linux kernel has a macro __is_constexpr to test if something is an
> > integer constant expression, see  .  That is a much
> > better idea imo.  There could be a builtin for that of course, but an
> > attribute is less powerful, less usable, less useful.
> Hi Segher,
> Thanks for the suggestions. I am not sure tho if we could use a macro
> similar to __is_constexpr
> to check if parameter is constant inside an inline function (which is
> the case for intrinsics) ?
> 
> For eg:
> #define __is_constexpr(x) \
> (sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))
> 
> inline int foo(const int x)
> {
>   _Static_assert (__is_constexpr (x));
>   return x;
> }
> 
> int main()
> {
>   return foo (1);
> }
> 
> results in:
> foo.c: In function ‘foo’:
> foo.c:8:3: error: static assertion failed
> 8 |   _Static_assert (__is_constexpr (x));
> 
> Initially we tried to use __Static_assert (__builtin_constant_p (arg))
> for the same purpose but that did not work
> because while parsing the intrinsic function, the FE cannot determine
> if the arg is indeed a constant.
> I guess the static assertion or __is_constexpr would work only if the
> intrinsic were defined as a macro instead of an inline function ?
> Or am I misunderstanding ?

You have to use it at the call site:

#define __is_constexpr(x) \
 (sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))

#define foo(x) foo(({ _Static_assert(__is_constexpr(x), "no ICE"); (x); }))

inline int (foo)(const int x)
{
  return x;
}

int main()
{
  foo(1);
  int n = 1;
  foo(n);   // error
}


--Martin




Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-06 Thread Martin Sebor via Gcc

On 8/6/21 4:51 AM, Richard Earnshaw wrote:



On 06/08/2021 01:06, Martin Sebor via Gcc wrote:

On 8/4/21 3:46 AM, Richard Earnshaw wrote:



On 03/08/2021 18:44, Martin Sebor wrote:

On 8/3/21 4:11 AM, Prathamesh Kulkarni via Gcc wrote:
On Tue, 27 Jul 2021 at 13:49, Richard Biener 
 wrote:


On Mon, Jul 26, 2021 at 11:06 AM Prathamesh Kulkarni via Gcc
 wrote:


On Fri, 23 Jul 2021 at 23:29, Andrew Pinski  
wrote:


On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc
 wrote:


Hi,
Continuing from this thread,
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
The proposal is to provide a mechanism to mark a parameter in a
function as a literal constant.

Motivation:
Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h:

__extension__ extern __inline int32x2_t
__attribute__  ((__always_inline__, __gnu_inline__, 
__artificial__))

vshl_n_s32 (int32x2_t __a, const int __b)
{
   return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
}

and it's caller:

int32x2_t f (int32x2_t x)
{
    return vshl_n_s32 (x, 1);
}


Can't you do similar to what is done already in the aarch64 
back-end:

#define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0]))
#define __AARCH64_LANE_CHECK(__vec, __idx)  \
 __builtin_aarch64_im_lane_boundsi (sizeof(__vec),
sizeof(__vec[0]), __idx)

?
Yes this is about lanes but you could even add one for min/max 
which
is generic and such; add an argument to say the intrinsics name 
even.

You could do this as a non-target builtin if you want and reuse it
also for the aarch64 backend.

Hi Andrew,
Thanks for the suggestions. IIUC, we could use this approach to 
check

if the argument
falls within a certain range (min / max), but I am not sure how it
will help to determine
if the arg is a constant immediate ? AFAIK, vshl_n intrinsics 
require

that the 2nd arg is immediate ?

Even the current RTL builtin checking is not consistent across
optimization levels:
For eg:
int32x2_t f(int32_t *restrict a)
{
   int32x2_t v = vld1_s32 (a);
   int b = 2;
   return vshl_n_s32 (v, b);
}

With pristine trunk, compiling with -O2 results in no errors because
constant propagation replaces 'b' with 2, and during expansion,
expand_builtin_args is happy. But at -O0, it results in the error -
"argument 2 must be a constant immediate".

So I guess we need some mechanism to mark a parameter as a 
constant ?


I guess you want to mark it in a way that the frontend should force
constant evaluation and error if that's not possible?   C++ doesn't
allow to declare a parameter as 'constexpr' but something like

void foo (consteval int i);

since I guess you do want to allow passing constexpr arguments
in C++ or in C extended forms of constants like

static const int a[4];

foo (a[1]);

?  But yes, this looks useful to me.

Hi Richard,
Thanks for the suggestions and sorry for late response.
I have attached a prototype patch that implements consteval attribute.
As implemented, the attribute takes at least one argument(s), which
refer to parameter position,
and the corresponding parameter must be const qualified, failing
which, the attribute is ignored.


I'm curious why the argument must be const-qualified.  If it's
to keep it from being changed in ways that would prevent it from
being evaluated at compile-time in the body of the function then
to be effective, the enforcement of the constraint should be on
the definition of the function.  Otherwise, the const qualifier
could be used in a declaration of a function but left out from
a subsequent definition of it, letting it modify it, like so:

   __attribute__ ((consteval (1))) void f (const int);

   inline __attribute__ ((always_inline)) void f (int i) { ++i; }


In this particular case it's because the inline function is 
implementing an intrinsic operation in the architecture and the 
instruction only supports a literal constant value.  At present we 
catch this while trying to expand the intrinsic, but that can lead to 
poor diagnostics because we really want to report against the line of 
code calling the intrinsic.


Presumably the intrinsics can accept (or can be made to accept) any
constant integer expressions, not just literals.  E.g., the aarch64
builtin below accepts them.  For example, this is accepted in C++:

   __Int64x2_t void f (__Int32x2_t a)
   {
 constexpr int n = 2;
 return __builtin_aarch64_vshll_nv2si (a, n + 1);
   }

Making the intrinscis accept constant arguments in constexpr-like
functions and introducing a constexpr-lite attribute (for C code)
was what I was suggesting bythe constexpr comment below.  I'd find
that a much more general and more powerful design.



Yes, it would be unfortunate if the rule made it impossible to avoid 
idiomatic const-exprs like those you would write in C++, or even those 
you'd write naturally in C:


#define foo (1u << 5)



But my comment above was to highlight that if requiring the function
argument referenced by the proposed consteval attribute to be cons

gcc-10-20210806 is now available

2021-08-06 Thread GCC Administrator via Gcc
Snapshot gcc-10-20210806 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/10-20210806/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 10 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch 
releases/gcc-10 revision 2d015fbba95d3136eabb132e43b3b3f96a4eeb69

You'll find:

 gcc-10-20210806.tar.xz   Complete GCC

  SHA256=12f21e5792d1f1049caeb310510ec76b035c1f5f9dee1f0b047ce28a172f47e1
  SHA1=723ac50f4a702662c6f234044972054a5f19ae2e

Diffs from 10-20210730 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-10
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.