Recognizing loop pattern
I'm trying to detect loops of the form while (*x != y) ++x; which mimic the behaviour of function rawmemchr. Note, the size of *x is not necessarily one byte. Thus ultimately I would like to detect such loops and replace them with calls to builtins rawmemchr8, rawmemchr16, rawmemchr32 if they are implemented in the backend: x = __builtin_rawmemchr16(x, y); I'm wondering whether there is a particular place in order to detect such loop patterns. For example, in the loop distribution pass GCC recognizes loops which mimic the behavior of memset, memcpy, memmove and replaces them with calls to their corresponding builtins, respectively. The pass and in particular partitioning of statements depends on whether a statement is used outside of a partition or not. This works perfectly fine for loops which implement the mentioned mem* operations since their result is typically ignored. However, the result of a rawmemchr function is/should never be ignored. Therefore, such loops are currently recognized as having a reduction which makes an implementation into the loop distribution pass not straight forward to me. Are there other places where you would detect such loops? Any comments? Cheers, Stefan
Re: [__mulvti3] register allocator plays shell game
On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak wrote: > > Hi, > > for the AMD64 alias x86_64 platform and the __int128_t [DW]type, > the first few lines of the __mulvDI3() function from libgcc2.c > > | DWtype > | __mulvDI3 (DWtype u, DWtype v) > | { > | /* The unchecked multiplication needs 3 Wtype x Wtype multiplications, > | but the checked multiplication needs only two. */ > | const DWunion uu = {.ll = u}; > | const DWunion vv = {.ll = v}; > | > | if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1)) > | { > | /* u fits in a single Wtype. */ > | if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1)) > | { > |/* v fits in a single Wtype as well. */ > |/* A single multiplication. No overflow risk. */ > |return (DWtype) uu.s.low * (DWtype) vv.s.low; > | } > > are compiled to this braindead code (obtained from libgcc.a of > GCC 10.2.0 installed on Debian): > > <__mulvti3>: >0: 41 55 push %r13 >2: 49 89 cb mov%rcx,%r11 >5: 48 89 d0 mov%rdx,%rax >8: 49 89 d2 mov%rdx,%r10 >b: 41 54 push %r12 >d: 49 89 fc mov%rdi,%r12 > 10: 48 89 d1 mov%rdx,%rcx > 13: 49 89 f0 mov%rsi,%r8 > 16: 4c 89 e2 mov%r12,%rdx > 19: 49 89 f5 mov%rsi,%r13 > 1c: 53push %rbx > 1d: 48 89 fe mov%rdi,%rsi > 20: 48 c1 fa 3f sar$0x3f,%rdx > 24: 48 c1 f8 3f sar$0x3f,%rax > 28: 4c 89 df mov%r11,%rdi > 2b: 4c 39 c2 cmp%r8,%rdx > 2e: 75 18 jne48 <__mulvti3+0x48> > 30: 4c 39 d8 cmp%r11,%rax > 33: 75 6b jnea0 <__mulvti3+0xa0> > 35: 4c 89 e0 mov%r12,%rax > 38: 49 f7 ea imul %r10 > 3b: 5bpop%rbx > 3c: 41 5c pop%r12 > 3e: 41 5d pop%r13 > 40: c3retq > ... > > There are EIGHT superfluous MOV instructions here, clobbering the > non-volatile registers RBX, R12 and R13, plus THREE superfluous > PUSH/POP pairs. > > What stops GCC from generating the following straightforward code > (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)? > > .intel_syntax noprefix > __mulvti3: > mov r8, rdi > mov r9, rdx > sra r8, 63 > sra r9, 63 > cmp r8, rsi > jne __mulvti3+0x48+65-31 > cmp r9, rcx > jne __mulvti3+0xa0+65-31 > mov rax, rdi > imul rdx > ret > ... > > > not amused can you open a bugreport please? Richard. > Stefan Kanthak
RE: Recognizing loop pattern
> -Original Message- > From: Gcc On Behalf Of Stefan Schulze > Frielinghaus via Gcc > Sent: 26 October 2020 09:58 > To: gcc@gcc.gnu.org > Subject: Recognizing loop pattern > > I'm trying to detect loops of the form > > while (*x != y) > ++x; > > which mimic the behaviour of function rawmemchr. Note, the size of *x is > not > necessarily one byte. Thus ultimately I would like to detect such loops and > replace them with calls to builtins rawmemchr8, rawmemchr16, > rawmemchr32 if > they are implemented in the backend: > > x = __builtin_rawmemchr16(x, y); > > I'm wondering whether there is a particular place in order to detect such > loop > patterns. For example, in the loop distribution pass GCC recognizes loops > which mimic the behavior of memset, memcpy, memmove and replaces > them with > calls to their corresponding builtins, respectively. The pass and in > particular partitioning of statements depends on whether a statement is used > outside of a partition or not. This works perfectly fine for loops which > implement the mentioned mem* operations since their result is typically > ignored. However, the result of a rawmemchr function is/should never be > ignored. Therefore, such loops are currently recognized as having a > reduction > which makes an implementation into the loop distribution pass not straight > forward to me. > > Are there other places where you would detect such loops? Any comments? Not an expert on these, but GCC has some similar stuff in tree-scalar-evolution.c for detecting popcount loops etc. Thanks, Kyrill > > Cheers, > Stefan
Re: Question about whether a code fragment is expected to parse.
On 10/25/20 7:52 AM, Iain Sandoe wrote: Hi Given that GNU attributes are not part of the standard.. I wonder if the following is expected to work? __attribute__((__deprecated__)) extern "C" __attribute__((__visibility__("default"))) void foo () { } t.C:3:8: error: expected unqualified-id before string constant 3 | extern "C" __attribute__((__visibility__("default"))) I don't see why it should be will-formed. 'extern "C"' is a linkage specification, which precedes (or encloses) a declaration. It is not storage-specified with an attached string constant. nathan -- Nathan Sidwell
Re: Question about whether a code fragment is expected to parse.
On 10/26/20 7:08 AM, Nathan Sidwell wrote: On 10/25/20 7:52 AM, Iain Sandoe wrote: Hi Given that GNU attributes are not part of the standard.. I wonder if the following is expected to work? __attribute__((__deprecated__)) extern "C" __attribute__((__visibility__("default"))) void foo () { } t.C:3:8: error: expected unqualified-id before string constant 3 | extern "C" __attribute__((__visibility__("default"))) I don't see why it should be will-formed. 'extern "C"' is a linkage well-formed specification, which precedes (or encloses) a declaration. It is not storage-specified with an attached string constant. nathan -- Nathan Sidwell
Re: Recognizing loop pattern
On Mon, Oct 26, 2020 at 10:59 AM Stefan Schulze Frielinghaus via Gcc wrote: > > I'm trying to detect loops of the form > > while (*x != y) > ++x; > > which mimic the behaviour of function rawmemchr. Note, the size of *x is not > necessarily one byte. Thus ultimately I would like to detect such loops and > replace them with calls to builtins rawmemchr8, rawmemchr16, rawmemchr32 if > they are implemented in the backend: > > x = __builtin_rawmemchr16(x, y); > > I'm wondering whether there is a particular place in order to detect such loop > patterns. For example, in the loop distribution pass GCC recognizes loops > which mimic the behavior of memset, memcpy, memmove and replaces them with > calls to their corresponding builtins, respectively. The pass and in > particular partitioning of statements depends on whether a statement is used > outside of a partition or not. This works perfectly fine for loops which > implement the mentioned mem* operations since their result is typically > ignored. However, the result of a rawmemchr function is/should never be > ignored. Therefore, such loops are currently recognized as having a reduction > which makes an implementation into the loop distribution pass not straight > forward to me. > > Are there other places where you would detect such loops? Any comments? loop distribution is the correct pass to look at. You're simply the first to recognize a reduction pattern. And yes, you'll likely need some generic adjustments to the code to handle this. Richard. > Cheers, > Stefan
Re: Question about whether a code fragment is expected to parse.
Nathan Sidwell wrote: On 10/26/20 7:08 AM, Nathan Sidwell wrote: On 10/25/20 7:52 AM, Iain Sandoe wrote: Given that GNU attributes are not part of the standard.. I wonder if the following is expected to work? __attribute__((__deprecated__)) extern "C" __attribute__((__visibility__("default"))) void foo () { } t.C:3:8: error: expected unqualified-id before string constant 3 | extern "C" __attribute__((__visibility__("default"))) I don't see why it should be will-formed. 'extern "C"' is a linkage well-formed OK. That’s fair - but these are GNU attributes, we could choose what constitutes well-formed-ness (with sufficient justification and thought). === I checked with clang and: __attrs extern “C” [more attrs] decl…. Applies _attr and more attrs to decl… (it does also apply the linkage spec, of course) __attrs extern “C” { decl .. decl .. } silently [with -Wall Wextra] discards the __attrs (it does not apply them to the contained decls). ——— My difficulty is that clang’s acceptance of the [currently ill-formed, in GCC] prefix attributes on a linkage spec has been baked into system headers, which will break as soon as I apply implemented missing functionality (it’s blocking that), so I need to formulate a way forward. ——— Options: 1/ we could make prefix attributes on a linkage spec well-formed IFF they applied to all decls covered by the linkage spec (so both cases above would get the prefix __attrs applied to all decls covered by the linkage spec). * If it is decided that this to remain ill-formed code, then we ought to issue a proper diagnostic (not just fall over on the extern decl spec). * fixing the clang bug, won’t fix the headers. 2/ I could make this work only for Objective-C (which is kinda partially done already), sorta yucky but in a bounded manner. 3/ I could make this work behind an -fclang-compatibility flag (there are a number of cases where similar problems exist, and I suspect that something along these lines will be needed to make Darwin GCC survive in the medium term). 4/ I could apply Yet More fix-includes (which is also yucky). 5/ other suggestions welcome. === * I made patches to make this work - those patches would also apply reasonably easily to (a) issuing a better diagnostic [1b] (b) making prefix attrs on linkage specs apply to ObjC++ only [2] and, I suppose, it would be equally easy to add a clang compat flag to hide them behind. [3] * ISTM that -std=clang++XY isn’t going to work in the same way as gnu++XY does, since the behaviour for clang isn’t particularly gated on any standard version (AFAICT). * In any event, I file a bug against clang because its actions are inconsistent, * I’ve also asked the vendor to consider making future versions of the header well-formed (but that doesn’t solve what’s already in the wild). thoughts? Iain
Re: [__mulvti3] register allocator plays shell game
Richard Biener wrote: > On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak > wrote: >> >> Hi, >> >> for the AMD64 alias x86_64 platform and the __int128_t [DW]type, >> the first few lines of the __mulvDI3() function from libgcc2.c >> >> | DWtype >> | __mulvDI3 (DWtype u, DWtype v) >> | { >> | /* The unchecked multiplication needs 3 Wtype x Wtype multiplications, >> | but the checked multiplication needs only two. */ >> | const DWunion uu = {.ll = u}; >> | const DWunion vv = {.ll = v}; >> | >> | if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1)) >> | { >> | /* u fits in a single Wtype. */ >> | if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1)) >> | { >> |/* v fits in a single Wtype as well. */ >> |/* A single multiplication. No overflow risk. */ >> |return (DWtype) uu.s.low * (DWtype) vv.s.low; >> | } >> >> are compiled to this braindead code (obtained from libgcc.a of >> GCC 10.2.0 installed on Debian): >> >> <__mulvti3>: >>0: 41 55 push %r13 >>2: 49 89 cb mov%rcx,%r11 >>5: 48 89 d0 mov%rdx,%rax >>8: 49 89 d2 mov%rdx,%r10 >>b: 41 54 push %r12 >>d: 49 89 fc mov%rdi,%r12 >> 10: 48 89 d1 mov%rdx,%rcx >> 13: 49 89 f0 mov%rsi,%r8 >> 16: 4c 89 e2 mov%r12,%rdx >> 19: 49 89 f5 mov%rsi,%r13 >> 1c: 53push %rbx >> 1d: 48 89 fe mov%rdi,%rsi >> 20: 48 c1 fa 3f sar$0x3f,%rdx >> 24: 48 c1 f8 3f sar$0x3f,%rax >> 28: 4c 89 df mov%r11,%rdi >> 2b: 4c 39 c2 cmp%r8,%rdx >> 2e: 75 18 jne48 <__mulvti3+0x48> >> 30: 4c 39 d8 cmp%r11,%rax >> 33: 75 6b jnea0 <__mulvti3+0xa0> >> 35: 4c 89 e0 mov%r12,%rax >> 38: 49 f7 ea imul %r10 >> 3b: 5bpop%rbx >> 3c: 41 5c pop%r12 >> 3e: 41 5d pop%r13 >> 40: c3retq >> ... >> >> There are EIGHT superfluous MOV instructions here, clobbering the >> non-volatile registers RBX, R12 and R13, plus THREE superfluous >> PUSH/POP pairs. >> >> What stops GCC from generating the following straightforward code >> (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)? >> >> .intel_syntax noprefix >> __mulvti3: >> mov r8, rdi >> mov r9, rdx >> sra r8, 63 >> sra r9, 63 >> cmp r8, rsi >> jne __mulvti3+0x48+65-31 >> cmp r9, rcx >> jne __mulvti3+0xa0+65-31 >> mov rax, rdi >> imul rdx >> ret >> ... >> >> >> not amused > > can you open a bugreport please? I'd like to discuss alternative implementations of __mulvti3() first: the very first lines of libgcc2.c reads | /* More subroutines needed by GCC output code on some machines. */ | /* Compile this one with gcc. */ Why don't you take advantage of that and implement __mulvDI3() as follows? DWtype __mulvDI3 (DWtype multiplicand, DWtype multiplier) { DWtype product; if (__builtin_mul_overflow(multiplicand, multiplier, &product)) abort(); return product; } For the AMD64 platform, instead of the 131 instructions in 439 bytes (partially cited above) GCC now generates 107 instructions in 324 bytes ... (un)fortunately showing this bug again, now clobbering RBP and RBX, pushing/popping RBP, RBX, R12, R14 and R15, and still moving them around without necessity: __mulvti3: movq%rdi, %r8 movq%rdx, %rdi pushq %r15 movq%rcx, %r9 movq%r8, %rdx movq%rdi, %rax pushq %r14 sarq$63, %rdx pushq %r12 sarq$63, %rax pushq %rbp xorl%ebp, %ebp pushq %rbx cmpq%rsi, %rdx jne .L4 cmpq%rcx, %rax jne .L5 movq%r8, %rax imulq %rdi movq%rax, %rbx .L2: testq %rbp, %rbp jne __mulvti3.cold movq%rbx, %rax popq%rbx popq%rbp popq%r12 popq%r14 popq%r15 ret ... .L4: cmpq%rcx, %rax jne .L7 ... jns .L8 xorl%r10d, %r10d subq%r10, %rax sbbq%r12, %rdx .L8: testq%r12, %r12 jns .L9 ... jne .L10 movq%rcx, %rbx movq%r10, %rdx jmp .L2 ... jmp .L6 .L7: ... ja .L3 ... ja .L3 ... jne .L11 ... jl .L2 .L3: movl$1, %ebp jmp .L2 .L11: testq%r10, %r10 js .L2 jmp .L3 .L10: ... jmp .L3 __mulvti3.cold: .L13:
gsi_remove on a call
I'm running into grief in verify_node in cgraph.c when I use gsi_remove on a call statement. Specifically it's a free statement which I've replaced with other free statements as part of my structure reorg optimizations. Note, in other working code I do this with malloc and it doesn't seem to be a problem. Where it happens it's trying to look at the call graph edges. Is there a way to remove the edge in question or mark it to be ignored? I see that line below about built in unreachable and wonder if I'm supposed to set the decl to that but I don't see others doing it so... Here's the code in cgraph (e->call_stmt is the free in question:) if (gimple_has_body_p (e->caller->decl) && !e->caller->inlined_to && !e->speculative /* Optimized out calls are redirected to __builtin_unreachable. */ && (e->count.nonzero_p () || ! e->callee->decl || !fndecl_built_in_p (e->callee->decl, BUILT_IN_UNREACHABLE)) && count == ENTRY_BLOCK_PTR_FOR_FN(DECL_STRUCT_FUNCTION (decl))->count && (!e->count.ipa_p () && e->count.differs_from_p (gimple_bb (e->call_stmt)->count))) { : Thanks, Gary CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and contains information that is confidential and proprietary to Ampere Computing or its subsidiaries. It is to be used solely for the purpose of furthering the parties' business relationship. Any unauthorized review, copying, or distribution of this email (or any attachments thereto) is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.