[Bug tree-optimization/56456] [meta-bug] bogus warning when inlining or unrolling: "array subscript is above array bounds"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56456 Mason changed: What|Removed |Added CC||jwakely.gcc at gmail dot com, ||law at redhat dot com, ||rguenth at gcc dot gnu.org, ||slash.tmp at free dot fr --- Comment #2 from Mason --- A few more bugs should be added to this tracker: (It seems I don't have permission to do that?) bug 59124 bug 63441 bug 63477 bug 80907 bug 82286 Adding my own testcase here: extern int array[3]; void foo(int n) { for (int i = 0; i < n; ++i) for (int j = 0; j < i; ++j) if (array[i] == array[j]) array[j] = 0; } gcc -Wall -O3 test.c triggers bogus warning(s) with any version of gcc >= 4.8
[Bug tree-optimization/56456] [meta-bug] bogus warning when inlining or unrolling: "array subscript is above array bounds"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56456 --- Comment #5 from Mason --- Slightly smaller testcase, similar to bug 80907. extern int M[16]; void foo(int n) { for (int i = 0; i < n; ++i) for (int j = 0; j < i; ++j) M[i+j] = 0; } $ gcc-7 -O3 -Wall -S testcase4.c testcase4.c: In function 'foo': testcase4.c:6:5: warning: array subscript is above array bounds [-Warray-bounds] M[i+j] = 0; ~^ testcase4.c:6:5: warning: array subscript is above array bounds [-Warray-bounds] testcase4.c:6:5: warning: array subscript is above array bounds [-Warray-bounds] testcase4.c:6:5: warning: array subscript is above array bounds [-Warray-bounds] testcase4.c:6:5: warning: array subscript is above array bounds [-Warray-bounds] testcase4.c:6:5: warning: array subscript is above array bounds [-Warray-bounds] testcase4.c:6:5: warning: array subscript is above array bounds [-Warray-bounds] testcase4.c:6:5: warning: array subscript is above array bounds [-Warray-bounds] Same result with trunk. Using -fopt-info adds: testcase4.c:5:3: note: loop turned into non-loop; it never loops. testcase4.c:5:3: note: loop with 17 iterations completely unrolled
[Bug tree-optimization/65461] -Warray-bounds warnings in the linux kernel (free_area_init_nodes)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65461 Mason changed: What|Removed |Added CC||msebor at gcc dot gnu.org, ||slash.tmp at free dot fr --- Comment #3 from Mason --- Here is a reduced test case: extern void foo(int *p); extern int array[2]; void func(void) { int i; for (i = 1; i < 2; i++) { if (i == 1) continue; array[i-1] = 0; } foo(&i); } $ gcc-7 -O3 -Wall -S testcase5.c testcase5.c: In function 'func': testcase5.c:9:14: warning: array subscript is below array bounds [-Warray-bounds] array[i-1] = 0; ~^ It is obvious that the loop is a NOP (other than setting i to 2) (start at 1, skip that index, move to 2, exit loop) And gcc figures it out, but only after issuing a spurious warning. func: subq$24, %rsp # allocate space on the stack leaq12(%rsp), %rdi # copy &i to rdi movl$2, 12(%rsp)# i = 2 callfoo # foo(&i) addq$24, %rsp # clean up stack ret
[Bug middle-end/66031] Spurious array bounds warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66031 Mason changed: What|Removed |Added CC||slash.tmp at free dot fr --- Comment #2 from Mason --- FWIW, the following code looks somewhat fishy to me: unsigned char f = 10; extern unsigned char p[1]; char check(unsigned int n) { if (n - 1 >= f) return p[n]; return 1; } p[n] is well-defined iff n = 0 If n = 0, n - 1 = UINT_MAX, thus the test is true for any f No other value of n is allowed as an index into p. Thus I would expect gcc to turn the code into return (n == 0) ? p[0] : 1;
[Bug rtl-optimization/83272] New: Unnecessary mask instruction generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83272 Bug ID: 83272 Summary: Unnecessary mask instruction generated Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: slash.tmp at free dot fr Target Milestone: --- Consider the following testcase: char foo(unsigned char n) { static const char map[16] = "wxyz"; return map[n / 16]; } gcc-7 -O2 -march=haswell -S testcase.c generates: foo: shrb$4, %dil andl$15, %edi movzbl map.2295(%rdi), %eax ret On this platform, CHAR_BIT = 8 and UCHAR_MAX = 255 Therefore n / 16 is guaranteed to be less than 16. Yet, GCC generates an unnecessary mask instruction (andl $15, %edi). https://gcc.gnu.org/ml/gcc-help/2017-11/msg00102.html
[Bug rtl-optimization/83272] Unnecessary mask instruction generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83272 --- Comment #2 from Mason --- (In reply to Jakub Jelinek from comment #1) > I don't believe the andl is not needed after shrb, as that is an 8-bit > operand size, it should leave the upper 56 bits of the register unmodified. > And unsigned char argument is in the ABI passed as int, so I think the upper > 32 bits are undefined, which the andl instruction clears. I checked the amd64 SysV ABI, and didn't see a requirement for a function returning an INTEGER type smaller than 64 bits to clear the upper bits? (The caller knows what bits are valid.) Anyway, I may have oversimplified the testcase. Consider this one: char foo(unsigned char *p) { static const char map[16] = "wxyz"; return map[*p / 16]; } foo: movzbl (%rdi), %eax shrb$4, %al andl$15, %eax movzbl map.2295(%rax), %eax ret movzbl does all bits of RAX. shrb discards 4 of the 8 bits, leaving only 4. Thus, andl is a no-op in that case.
[Bug rtl-optimization/83272] Unnecessary mask instruction generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83272 --- Comment #3 from Mason --- I think Jakub is right about an interaction between movzbl and shrb. unsigned long long foo1(unsigned char *p) { return *p; } foo1: movzbl (%rdi), %eax ret I.e. gcc "knows" that movzbl clears the upper bits of RAX. However... unsigned long long foo2(unsigned char *p) { return *p / 2; } foo2: movzbl (%rdi), %eax shrb%al movzbl %al, %eax ret gcc seems to think that shrb might "mess up" some bits in RAX, which then need to be cleared again through an extra movzbl. "shr %al" is encoded as "d0 e8" while "shr %eax" is "d1 e8" The former is not more efficient than the latter. As Marc Glisse pointed out, a solution is convincing gcc to store the intermediate results in a register: unsigned long long foo3(unsigned char *p) { unsigned int temp = *p; return temp / 2; } foo3: movzbl (%rdi), %eax shrl%eax ret gcc "knows" that shrl does not "disturb" RAX, so no extra movzbl required. According to the "integer promotions" rules, *p is promoted to int in the expression (*p / 2) used in foo2. However... unsigned long long foo4(unsigned char *p) { int temp = *p; return temp / 2; } foo4: movzbl (%rdi), %eax sarl%eax cltq ret An 8-bit unsigned char promoted to int or unsigned int has the same value. I expect the same code generated for foo2, foo3, foo4. Yet we get 3 different code snippets.
[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #18 from Mason --- Hello Michael_S, As far as I can see, massaging the source helps GCC generate optimal code (in terms of instruction count, not convinced about scheduling). #include typedef unsigned long long u64; void add4i(u64 dst[4], const u64 A[4], const u64 B[4]) { unsigned char c = 0; c = _addcarry_u64(c, A[0], B[0], dst+0); c = _addcarry_u64(c, A[1], B[1], dst+1); c = _addcarry_u64(c, A[2], B[2], dst+2); c = _addcarry_u64(c, A[3], B[3], dst+3); } On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate the expected: add4i: movq(%rdx), %rax addq(%rsi), %rax movq%rax, (%rdi) movq8(%rsi), %rax adcq8(%rdx), %rax movq%rax, 8(%rdi) movq16(%rsi), %rax adcq16(%rdx), %rax movq%rax, 16(%rdi) movq24(%rdx), %rax adcq24(%rsi), %rax movq%rax, 24(%rdi) ret I'll run a few benchmarks to test optimal scheduling.
[Bug target/110104] New: gcc produces sub-optimal code for _addcarry_u64 chain
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104 Bug ID: 110104 Summary: gcc produces sub-optimal code for _addcarry_u64 chain Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: slash.tmp at free dot fr Target Milestone: --- Consider the following code: #include typedef unsigned long long u64; typedef unsigned __int128 u128; void testcase1(u64 *acc, u64 a, u64 b) { u128 res = (u128)a*b; u64 lo = res, hi = res >> 64; unsigned char cf = 0; cf = _addcarry_u64(cf, lo, acc[0], acc+0); cf = _addcarry_u64(cf, hi, acc[1], acc+1); cf = _addcarry_u64(cf, 0, acc[2], acc+2); } void testcase2(u64 *acc, u64 a, u64 b) { u128 res = (u128)a * b; u64 lo = res, hi = res >> 64; asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "adc $0, %[D2]" : [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) : [LO] "r" (lo), [HI] "r" (hi) : "cc"); } Compilation with either gcc-trunk -Wall -Wextra -O3 -S testcase.c gcc-trunk -Wall -Wextra -Os -S testcase.c generate the same code: // rdi = acc, rsi = a, rdx = b testcase1: movq %rsi, %rax mulq %rdx addq %rax, (%rdi) movq %rdx, %rax adcq 8(%rdi), %rax adcq $0, 16(%rdi) movq %rax, 8(%rdi) ret testcase2: movq %rsi, %rax ; rax = rsi = a mulq %rdx ; rdx:rax = rax*rdx = a*b add %rax, (%rdi) ; acc[0] += lo adc %rdx, 8(%rdi) ; acc[1] += hi + cf adc $0, 16(%rdi) ; acc[2] += cf ret Conclusion: gcc generates the expected code for testcase2. However, the code for testcase1 is sub-optimal. movq %rdx, %rax adcq 8(%rdi), %rax movq %rax, 8(%rdi) instead of adc %rdx, 8(%rdi) ; acc[1] += hi + cf The copy of rdx to rax is useless. The (load/add+store) ops can be merged into an load/add/store op.
[Bug target/102974] GCC optimization is very poor for add carry and multiplication combos
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102974 --- Comment #11 from Mason --- Here's umul_least_64() rewritten as mul_64x64_128() in C typedef unsigned int u32; typedef unsigned long long u64; /* u32 acc[3], a[1], b[1] */ static void mul_add_32x32(u32 *acc, const u32 *a, const u32 *b) { u64 res = (u64)a[0] * b[0]; u32 lo = res, hi = res >> 32; asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "adc $0, %[D2]" : [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) : [LO] "r" (lo), [HI] "r" (hi) : "cc"); } /* u32 acc[5], a[2], b[2] */ void mul_64x64_128(u32 *acc, const u32 *a, const u32 *b) { mul_add_32x32(acc+0, a+0, b+0); mul_add_32x32(acc+1, a+0, b+1); mul_add_32x32(acc+1, a+1, b+0); mul_add_32x32(acc+2, a+1, b+1); } gcc-trunk -O3 -m32 mul_64x64_128: pushl %esi pushl %ebx movl 16(%esp), %ebx ; ebx = a movl 20(%esp), %esi ; esi = b movl 12(%esp), %ecx ; ecx = acc movl (%esi), %eax ; b0 mull (%ebx) ; a0*b0 add %eax, (%ecx) adc %edx, 4(%ecx) adc $0, 8(%ecx) movl 4(%esi), %eax; b1 mull (%ebx) ; a0*b1 add %eax, 4(%ecx) adc %edx, 8(%ecx) adc $0, 12(%ecx) movl (%esi), %eax ; b0 mull 4(%ebx) ; a1*b0 add %eax, 4(%ecx) adc %edx, 8(%ecx) adc $0, 12(%ecx) movl 4(%esi), %eax; b1 mull 4(%ebx) ; a1*b1 add %eax, 8(%ecx) adc %edx, 12(%ecx) adc $0, 16(%ecx) popl %ebx popl %esi ret
[Bug target/102974] GCC optimization is very poor for add carry and multiplication combos
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102974 --- Comment #12 from Mason --- Actually, in this case, we don't need to propagate the carry over 3 limbs. typedef unsigned int u32; typedef unsigned long long u64; /* u32 acc[2], a[1], b[1] */ static void mul_add_32x32(u32 *acc, const u32 *a, const u32 *b) { u64 res = (u64)a[0] * b[0]; u32 lo = res, hi = res >> 32; asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]" : [D0] "+m" (acc[0]), [D1] "+m" (acc[1]) : [LO] "r" (lo), [HI] "r" (hi) : "cc"); } /* u32 acc[4], a[2], b[2] */ void mul_64x64_128(u32 *acc, const u32 *a, const u32 *b) { mul_add_32x32(acc+0, a+0, b+0); mul_add_32x32(acc+1, a+0, b+1); mul_add_32x32(acc+1, a+1, b+0); mul_add_32x32(acc+2, a+1, b+1); } gcc-trunk -O3 -m32 mul_64x64_128: pushl %esi pushl %ebx movl 16(%esp), %ebx movl 20(%esp), %esi movl 12(%esp), %ecx movl (%esi), %eax mull (%ebx) add %eax, (%ecx) adc %edx, 4(%ecx) movl 4(%esi), %eax mull (%ebx) add %eax, 4(%ecx) adc %edx, 8(%ecx) movl (%esi), %eax mull 4(%ebx) add %eax, 4(%ecx) adc %edx, 8(%ecx) movl 4(%esi), %eax mull 4(%ebx) add %eax, 8(%ecx) adc %edx, 12(%ecx) popl %ebx popl %esi ret
[Bug target/102974] GCC optimization is very poor for add carry and multiplication combos
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102974 --- Comment #16 from Mason --- For the record, the example I provided was intended to show that, with some help, GCC can generate good code for bigint multiplication. In this situation, "help" means a short asm template.
[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #20 from Mason --- Doh! You're right. I come from a background where overlapping/aliasing inputs are heresy, thus got blindsided :( This would be the optimal code, right? add4i: # rdi = dst, rsi = a, rdx = b movq 0(%rdx), %r8 movq 8(%rdx), %rax movq16(%rdx), %rcx movq24(%rdx), %rdx addq 0(%rsi), %r8 adcq 8(%rsi), %rax adcq16(%rsi), %rcx adcq24(%rsi), %rdx movq%r8, 0(%rdi) movq%rax, 8(%rdi) movq%rcx, 16(%rdi) movq%rdx, 24(%rdi) ret FWIW, trunk generates even nastier code for znver2: add4i: movq(%rdx), %rax addq(%rsi), %rax movq8(%rsi), %rcx adcq8(%rdx), %rcx movq16(%rsi), %r8 adcq16(%rdx), %r8 movq24(%rdx), %rdx adcq24(%rsi), %rdx vmovq %rax, %xmm2 vpinsrq $1, %rcx, %xmm2, %xmm0 vmovq %r8, %xmm1 vpinsrq $1, %rdx, %xmm1, %xmm1 vinserti128 $0x1, %xmm1, %ymm0, %ymm0 vmovdqu %ymm0, (%rdi) vzeroupper ret
[Bug target/110104] gcc produces sub-optimal code for _addcarry_u64 chain
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104 --- Comment #2 from Mason --- You meant PR79173 ;) Latest update: https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621554.html I didn't see my testcase specifically in Jakub's patch, but I'll test trunk on godbolt when/if the patch lands.
[Bug target/110104] gcc produces sub-optimal code for _addcarry_u64 chain
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104 --- Comment #4 from Mason --- I confirm that trunk now emits the same code for testcase1 and testcase2. Thanks Jakub and Roger, great work!
[Bug target/110104] gcc produces sub-optimal code for _addcarry_u64 chain
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104 --- Comment #5 from Mason --- FWIW, trunk (gcc14) translates testcase3 to the same code as the other testcases, while remaining portable across all architectures: $ gcc-trunk -O3 -march=bdver3 testcase3.c typedef unsigned long long u64; typedef unsigned __int128 u128; void testcase3(u64 *acc, u64 a, u64 b) { int c1, c2; u128 res = (u128)a * b; u64 lo = res, hi = res >> 64; c1 = __builtin_add_overflow(lo, acc[0], &acc[0]); c2 = __builtin_add_overflow(hi, acc[1], &acc[1]) | __builtin_add_overflow(c1, acc[1], &acc[1]); __builtin_add_overflow(c2, acc[2], &acc[2]); } testcase3: movq%rsi, %rax mulq%rdx addq%rax, (%rdi) adcq%rdx, 8(%rdi) adcq$0, 16(%rdi) ret Thanks again, Jakub.