Recognizing loop pattern

2020-10-26 Thread Stefan Schulze Frielinghaus via Gcc
I'm trying to detect loops of the form

  while (*x != y)
++x;

which mimic the behaviour of function rawmemchr.  Note, the size of *x is not
necessarily one byte.  Thus ultimately I would like to detect such loops and
replace them with calls to builtins rawmemchr8, rawmemchr16, rawmemchr32 if
they are implemented in the backend:

  x = __builtin_rawmemchr16(x, y);

I'm wondering whether there is a particular place in order to detect such loop
patterns.  For example, in the loop distribution pass GCC recognizes loops
which mimic the behavior of memset, memcpy, memmove and replaces them with
calls to their corresponding builtins, respectively.  The pass and in
particular partitioning of statements depends on whether a statement is used
outside of a partition or not.  This works perfectly fine for loops which
implement the mentioned mem* operations since their result is typically
ignored.  However, the result of a rawmemchr function is/should never be
ignored.  Therefore, such loops are currently recognized as having a reduction
which makes an implementation into the loop distribution pass not straight
forward to me.

Are there other places where you would detect such loops?  Any comments?

Cheers,
Stefan


Re: [__mulvti3] register allocator plays shell game

2020-10-26 Thread Richard Biener via Gcc
On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak  wrote:
>
> Hi,
>
> for the AMD64 alias x86_64 platform and the __int128_t [DW]type,
> the first few lines of the __mulvDI3() function from libgcc2.c
>
> | DWtype
> | __mulvDI3 (DWtype u, DWtype v)
> | {
> |   /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
> |  but the checked multiplication needs only two.  */
> |   const DWunion uu = {.ll = u};
> |   const DWunion vv = {.ll = v};
> |
> |   if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
> | {
> |   /* u fits in a single Wtype.  */
> |   if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
> |  {
> |/* v fits in a single Wtype as well.  */
> |/* A single multiplication.  No overflow risk.  */
> |return (DWtype) uu.s.low * (DWtype) vv.s.low;
> |  }
>
> are compiled to this braindead code (obtained from libgcc.a of
> GCC 10.2.0 installed on Debian):
>
>  <__mulvti3>:
>0: 41 55 push   %r13
>2: 49 89 cb  mov%rcx,%r11
>5: 48 89 d0  mov%rdx,%rax
>8: 49 89 d2  mov%rdx,%r10
>b: 41 54 push   %r12
>d: 49 89 fc  mov%rdi,%r12
>   10: 48 89 d1  mov%rdx,%rcx
>   13: 49 89 f0  mov%rsi,%r8
>   16: 4c 89 e2  mov%r12,%rdx
>   19: 49 89 f5  mov%rsi,%r13
>   1c: 53push   %rbx
>   1d: 48 89 fe  mov%rdi,%rsi
>   20: 48 c1 fa 3f   sar$0x3f,%rdx
>   24: 48 c1 f8 3f   sar$0x3f,%rax
>   28: 4c 89 df  mov%r11,%rdi
>   2b: 4c 39 c2  cmp%r8,%rdx
>   2e: 75 18 jne48 <__mulvti3+0x48>
>   30: 4c 39 d8  cmp%r11,%rax
>   33: 75 6b jnea0 <__mulvti3+0xa0>
>   35: 4c 89 e0  mov%r12,%rax
>   38: 49 f7 ea  imul   %r10
>   3b: 5bpop%rbx
>   3c: 41 5c pop%r12
>   3e: 41 5d pop%r13
>   40: c3retq
> ...
>
> There are EIGHT superfluous MOV instructions here, clobbering the
> non-volatile registers RBX, R12 and R13, plus THREE superfluous
> PUSH/POP pairs.
>
> What stops GCC from generating the following straightforward code
> (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)?
>
> .intel_syntax noprefix
> __mulvti3:
> mov   r8, rdi
> mov   r9, rdx
> sra   r8, 63
> sra   r9, 63
> cmp   r8, rsi
> jne   __mulvti3+0x48+65-31
> cmp   r9, rcx
> jne   __mulvti3+0xa0+65-31
> mov   rax, rdi
> imul  rdx
> ret
> ...
>
>
> not amused

can you open a bugreport please?

Richard.

> Stefan Kanthak


RE: Recognizing loop pattern

2020-10-26 Thread Kyrylo Tkachov via Gcc



> -Original Message-
> From: Gcc  On Behalf Of Stefan Schulze
> Frielinghaus via Gcc
> Sent: 26 October 2020 09:58
> To: gcc@gcc.gnu.org
> Subject: Recognizing loop pattern
> 
> I'm trying to detect loops of the form
> 
>   while (*x != y)
> ++x;
> 
> which mimic the behaviour of function rawmemchr.  Note, the size of *x is
> not
> necessarily one byte.  Thus ultimately I would like to detect such loops and
> replace them with calls to builtins rawmemchr8, rawmemchr16,
> rawmemchr32 if
> they are implemented in the backend:
> 
>   x = __builtin_rawmemchr16(x, y);
> 
> I'm wondering whether there is a particular place in order to detect such
> loop
> patterns.  For example, in the loop distribution pass GCC recognizes loops
> which mimic the behavior of memset, memcpy, memmove and replaces
> them with
> calls to their corresponding builtins, respectively.  The pass and in
> particular partitioning of statements depends on whether a statement is used
> outside of a partition or not.  This works perfectly fine for loops which
> implement the mentioned mem* operations since their result is typically
> ignored.  However, the result of a rawmemchr function is/should never be
> ignored.  Therefore, such loops are currently recognized as having a
> reduction
> which makes an implementation into the loop distribution pass not straight
> forward to me.
> 
> Are there other places where you would detect such loops?  Any comments?

Not an expert on these, but GCC has some similar stuff in 
tree-scalar-evolution.c for detecting popcount loops etc.

Thanks,
Kyrill

> 
> Cheers,
> Stefan


Re: Question about whether a code fragment is expected to parse.

2020-10-26 Thread Nathan Sidwell

On 10/25/20 7:52 AM, Iain Sandoe wrote:

Hi

Given that GNU attributes are not part of the standard..

I wonder if the following is expected to work?

__attribute__((__deprecated__))
extern "C" __attribute__((__visibility__("default")))
void foo ()
{

}

t.C:3:8: error: expected unqualified-id before string constant
     3 | extern "C" __attribute__((__visibility__("default")))



I don't see why it should be will-formed.  'extern "C"' is a linkage 
specification, which precedes (or encloses) a declaration.  It is not 
storage-specified with an attached string constant.


nathan

--
Nathan Sidwell


Re: Question about whether a code fragment is expected to parse.

2020-10-26 Thread Nathan Sidwell

On 10/26/20 7:08 AM, Nathan Sidwell wrote:

On 10/25/20 7:52 AM, Iain Sandoe wrote:

Hi

Given that GNU attributes are not part of the standard..

I wonder if the following is expected to work?

__attribute__((__deprecated__))
extern "C" __attribute__((__visibility__("default")))
void foo ()
{

}

t.C:3:8: error: expected unqualified-id before string constant
 3 | extern "C" __attribute__((__visibility__("default")))



I don't see why it should be will-formed.  'extern "C"' is a linkage


well-formed

specification, which precedes (or encloses) a declaration.  It is not 
storage-specified with an attached string constant.


nathan




--
Nathan Sidwell


Re: Recognizing loop pattern

2020-10-26 Thread Richard Biener via Gcc
On Mon, Oct 26, 2020 at 10:59 AM Stefan Schulze Frielinghaus via Gcc
 wrote:
>
> I'm trying to detect loops of the form
>
>   while (*x != y)
> ++x;
>
> which mimic the behaviour of function rawmemchr.  Note, the size of *x is not
> necessarily one byte.  Thus ultimately I would like to detect such loops and
> replace them with calls to builtins rawmemchr8, rawmemchr16, rawmemchr32 if
> they are implemented in the backend:
>
>   x = __builtin_rawmemchr16(x, y);
>
> I'm wondering whether there is a particular place in order to detect such loop
> patterns.  For example, in the loop distribution pass GCC recognizes loops
> which mimic the behavior of memset, memcpy, memmove and replaces them with
> calls to their corresponding builtins, respectively.  The pass and in
> particular partitioning of statements depends on whether a statement is used
> outside of a partition or not.  This works perfectly fine for loops which
> implement the mentioned mem* operations since their result is typically
> ignored.  However, the result of a rawmemchr function is/should never be
> ignored.  Therefore, such loops are currently recognized as having a reduction
> which makes an implementation into the loop distribution pass not straight
> forward to me.
>
> Are there other places where you would detect such loops?  Any comments?

loop distribution is the correct pass to look at.  You're simply the first to
recognize a reduction pattern.  And yes, you'll likely need some generic
adjustments to the code to handle this.

Richard.

> Cheers,
> Stefan


Re: Question about whether a code fragment is expected to parse.

2020-10-26 Thread Iain Sandoe via Gcc

Nathan Sidwell  wrote:


On 10/26/20 7:08 AM, Nathan Sidwell wrote:

On 10/25/20 7:52 AM, Iain Sandoe wrote:



Given that GNU attributes are not part of the standard..

I wonder if the following is expected to work?

__attribute__((__deprecated__))
extern "C" __attribute__((__visibility__("default")))
void foo ()
{

}

t.C:3:8: error: expected unqualified-id before string constant
 3 | extern "C" __attribute__((__visibility__("default")))

I don't see why it should be will-formed.  'extern "C"' is a linkage


well-formed


OK. That’s fair - but these are GNU attributes, we could choose what
constitutes well-formed-ness (with sufficient justification and thought).

===

I checked with clang and:

__attrs
extern “C” [more attrs]
decl….

Applies _attr  and more attrs to decl… (it does also apply the linkage  
spec, of course)


__attrs
extern “C” {
 decl ..
 decl ..
}

silently [with -Wall Wextra] discards the __attrs (it does not apply them  
to the contained decls).


———

My difficulty is that clang’s acceptance of the [currently ill-formed, in  
GCC] prefix attributes on
a linkage spec has been baked into system headers, which will  break as  
soon as I apply
implemented missing functionality (it’s blocking that), so I need to  
formulate a way forward.


———

Options:

1/ we could make prefix attributes on a linkage spec well-formed IFF they  
applied to all decls covered by the linkage spec (so both cases above would  
get the prefix __attrs applied to all decls covered by the linkage spec).


 * If it is decided that this to remain ill-formed code, then we ought to issue 
a proper diagnostic (not just fall over on the extern decl spec).

 * fixing the clang bug, won’t fix the headers.

2/ I could make this work only for Objective-C (which is kinda partially  
done already), sorta yucky but in a bounded manner.


3/ I could make this work behind an -fclang-compatibility flag
  (there are a number of cases where similar problems exist, and I suspect that 
something along these lines will be needed to make Darwin GCC survive in the 
medium term).

4/ I could apply Yet More fix-includes (which is also yucky).

5/ other suggestions welcome.

===

* I made patches to make this work - those patches would also apply  
reasonably easily to

  (a) issuing a better diagnostic [1b]
  (b) making prefix attrs on linkage specs apply to ObjC++ only [2]
  and, I suppose, it would be equally easy to add a clang compat flag to hide 
them behind. [3]

* ISTM that -std=clang++XY isn’t going to work in the same way as gnu++XY  
does, since the behaviour for clang isn’t particularly gated on any  
standard version (AFAICT).


* In any event, I file a bug against clang because its actions are  
inconsistent,


* I’ve also asked the vendor to consider making future versions of the  
header well-formed (but that doesn’t solve what’s already in the wild).


thoughts?
Iain




Re: [__mulvti3] register allocator plays shell game

2020-10-26 Thread Stefan Kanthak
Richard Biener  wrote:

> On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak  
> wrote:
>>
>> Hi,
>>
>> for the AMD64 alias x86_64 platform and the __int128_t [DW]type,
>> the first few lines of the __mulvDI3() function from libgcc2.c
>>
>> | DWtype
>> | __mulvDI3 (DWtype u, DWtype v)
>> | {
>> |   /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
>> |  but the checked multiplication needs only two.  */
>> |   const DWunion uu = {.ll = u};
>> |   const DWunion vv = {.ll = v};
>> |
>> |   if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
>> | {
>> |   /* u fits in a single Wtype.  */
>> |   if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
>> |  {
>> |/* v fits in a single Wtype as well.  */
>> |/* A single multiplication.  No overflow risk.  */
>> |return (DWtype) uu.s.low * (DWtype) vv.s.low;
>> |  }
>>
>> are compiled to this braindead code (obtained from libgcc.a of
>> GCC 10.2.0 installed on Debian):
>>
>>  <__mulvti3>:
>>0: 41 55 push   %r13
>>2: 49 89 cb  mov%rcx,%r11
>>5: 48 89 d0  mov%rdx,%rax
>>8: 49 89 d2  mov%rdx,%r10
>>b: 41 54 push   %r12
>>d: 49 89 fc  mov%rdi,%r12
>>   10: 48 89 d1  mov%rdx,%rcx
>>   13: 49 89 f0  mov%rsi,%r8
>>   16: 4c 89 e2  mov%r12,%rdx
>>   19: 49 89 f5  mov%rsi,%r13
>>   1c: 53push   %rbx
>>   1d: 48 89 fe  mov%rdi,%rsi
>>   20: 48 c1 fa 3f   sar$0x3f,%rdx
>>   24: 48 c1 f8 3f   sar$0x3f,%rax
>>   28: 4c 89 df  mov%r11,%rdi
>>   2b: 4c 39 c2  cmp%r8,%rdx
>>   2e: 75 18 jne48 <__mulvti3+0x48>
>>   30: 4c 39 d8  cmp%r11,%rax
>>   33: 75 6b jnea0 <__mulvti3+0xa0>
>>   35: 4c 89 e0  mov%r12,%rax
>>   38: 49 f7 ea  imul   %r10
>>   3b: 5bpop%rbx
>>   3c: 41 5c pop%r12
>>   3e: 41 5d pop%r13
>>   40: c3retq
>> ...
>>
>> There are EIGHT superfluous MOV instructions here, clobbering the
>> non-volatile registers RBX, R12 and R13, plus THREE superfluous
>> PUSH/POP pairs.
>>
>> What stops GCC from generating the following straightforward code
>> (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)?
>>
>> .intel_syntax noprefix
>> __mulvti3:
>> mov   r8, rdi
>> mov   r9, rdx
>> sra   r8, 63
>> sra   r9, 63
>> cmp   r8, rsi
>> jne   __mulvti3+0x48+65-31
>> cmp   r9, rcx
>> jne   __mulvti3+0xa0+65-31
>> mov   rax, rdi
>> imul  rdx
>> ret
>> ...
>>
>>
>> not amused
> 
> can you open a bugreport please?


I'd like to discuss alternative implementations of __mulvti3() first:
the very first lines of libgcc2.c reads

| /* More subroutines needed by GCC output code on some machines.  */
| /* Compile this one with gcc.  */


Why don't you take advantage of that and implement __mulvDI3() as
follows?

DWtype
__mulvDI3 (DWtype multiplicand, DWtype multiplier)
{
DWtype product;

if (__builtin_mul_overflow(multiplicand, multiplier, &product))
abort();

return product;
}


For the AMD64 platform, instead of the 131 instructions in 439 bytes
(partially cited above) GCC now generates 107 instructions in 324 bytes
 ... (un)fortunately showing this bug again, now clobbering RBP and RBX,
pushing/popping RBP, RBX, R12, R14 and R15, and still moving them around
without necessity:


__mulvti3:
movq%rdi, %r8
movq%rdx, %rdi
pushq   %r15
movq%rcx, %r9
movq%r8, %rdx
movq%rdi, %rax
pushq   %r14
sarq$63, %rdx
pushq   %r12
sarq$63, %rax
pushq   %rbp
xorl%ebp, %ebp
pushq   %rbx
cmpq%rsi, %rdx
jne .L4
cmpq%rcx, %rax
jne .L5
movq%r8, %rax
imulq   %rdi
movq%rax, %rbx
.L2:
testq   %rbp, %rbp
jne __mulvti3.cold
movq%rbx, %rax
popq%rbx
popq%rbp
popq%r12
popq%r14
popq%r15
ret
...
.L4:
cmpq%rcx, %rax
jne .L7
...
jns .L8
xorl%r10d, %r10d
subq%r10, %rax
sbbq%r12, %rdx
.L8:
testq%r12, %r12
jns .L9
...
jne .L10
movq%rcx, %rbx
movq%r10, %rdx
jmp .L2
...
jmp .L6
.L7:
...
ja  .L3
...
ja  .L3
...
jne .L11
...
jl  .L2
.L3:
movl$1, %ebp
jmp .L2
.L11:
testq%r10, %r10
js  .L2
jmp .L3
.L10:
...
jmp .L3
__mulvti3.cold:
.L13:

gsi_remove on a call

2020-10-26 Thread Gary Oblock via Gcc
I'm running into grief in verify_node in cgraph.c
when I use gsi_remove on a call statement.

Specifically it's a free statement which I've replaced
with other free statements as part of my structure
reorg optimizations. Note, in other working code
I do this with malloc and it doesn't seem to be a problem.

Where it happens it's trying to look at the call graph edges.
Is there a way to remove the edge in question or mark it
to be ignored? I see that line below about built in unreachable
and wonder if I'm supposed to set the decl to that but I
don't see others doing it so...

Here's the code in cgraph (e->call_stmt is the free in question:)

 if (gimple_has_body_p (e->caller->decl)
 && !e->caller->inlined_to
 && !e->speculative
 /* Optimized out calls are redirected to __builtin_unreachable.  */
 && (e->count.nonzero_p ()
 || ! e->callee->decl
 || !fndecl_built_in_p (e->callee->decl, BUILT_IN_UNREACHABLE))
 && count == ENTRY_BLOCK_PTR_FOR_FN(DECL_STRUCT_FUNCTION  (decl))->count
&& (!e->count.ipa_p ()
 && e->count.differs_from_p (gimple_bb (e->call_stmt)->count)))
   {
  :

Thanks,

Gary






CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for 
the sole use of the intended recipient(s) and contains information that is 
confidential and proprietary to Ampere Computing or its subsidiaries. It is to 
be used solely for the purpose of furthering the parties' business 
relationship. Any unauthorized review, copying, or distribution of this email 
(or any attachments thereto) is strictly prohibited. If you are not the 
intended recipient, please contact the sender immediately and permanently 
delete the original and any copies of this email and any attachments thereto.