[Bug tree-optimization/56456] [meta-bug] bogus warning when inlining or unrolling: "array subscript is above array bounds"

2017-10-12 Thread slash.tmp at free dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56456

Mason  changed:

   What|Removed |Added

 CC||jwakely.gcc at gmail dot com,
   ||law at redhat dot com,
   ||rguenth at gcc dot gnu.org,
   ||slash.tmp at free dot fr

--- Comment #2 from Mason  ---
A few more bugs should be added to this tracker:
(It seems I don't have permission to do that?)

bug 59124
bug 63441
bug 63477
bug 80907
bug 82286

Adding my own testcase here:

extern int array[3];
void foo(int n)
{
for (int i = 0; i < n; ++i)
for (int j = 0; j < i; ++j)
if (array[i] == array[j])
array[j] = 0;
}

gcc -Wall -O3 test.c triggers bogus warning(s) with any version of gcc >= 4.8

[Bug tree-optimization/56456] [meta-bug] bogus warning when inlining or unrolling: "array subscript is above array bounds"

2017-10-18 Thread slash.tmp at free dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56456

--- Comment #5 from Mason  ---
Slightly smaller testcase, similar to bug 80907.

extern int M[16];
void foo(int n)
{
for (int i = 0; i < n; ++i)
for (int j = 0; j < i; ++j)
M[i+j] = 0;
}

$ gcc-7 -O3 -Wall -S testcase4.c 
testcase4.c: In function 'foo':
testcase4.c:6:5: warning: array subscript is above array bounds
[-Warray-bounds]
M[i+j] = 0;
~^
testcase4.c:6:5: warning: array subscript is above array bounds
[-Warray-bounds]
testcase4.c:6:5: warning: array subscript is above array bounds
[-Warray-bounds]
testcase4.c:6:5: warning: array subscript is above array bounds
[-Warray-bounds]
testcase4.c:6:5: warning: array subscript is above array bounds
[-Warray-bounds]
testcase4.c:6:5: warning: array subscript is above array bounds
[-Warray-bounds]
testcase4.c:6:5: warning: array subscript is above array bounds
[-Warray-bounds]
testcase4.c:6:5: warning: array subscript is above array bounds
[-Warray-bounds]

Same result with trunk.

Using -fopt-info adds:
testcase4.c:5:3: note: loop turned into non-loop; it never loops.
testcase4.c:5:3: note: loop with 17 iterations completely unrolled

[Bug tree-optimization/65461] -Warray-bounds warnings in the linux kernel (free_area_init_nodes)

2017-10-18 Thread slash.tmp at free dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65461

Mason  changed:

   What|Removed |Added

 CC||msebor at gcc dot gnu.org,
   ||slash.tmp at free dot fr

--- Comment #3 from Mason  ---
Here is a reduced test case:

extern void foo(int *p);
extern int array[2];
void func(void)
{
int i;

for (i = 1; i < 2; i++) {
if (i == 1) continue;
array[i-1] = 0;
}

foo(&i);
}

$ gcc-7 -O3 -Wall -S testcase5.c 
testcase5.c: In function 'func':
testcase5.c:9:14: warning: array subscript is below array bounds
[-Warray-bounds]
 array[i-1] = 0;
 ~^

It is obvious that the loop is a NOP (other than setting i to 2)
(start at 1, skip that index, move to 2, exit loop)
And gcc figures it out, but only after issuing a spurious warning.

func:
subq$24, %rsp   # allocate space on the stack
leaq12(%rsp), %rdi  # copy &i to rdi
movl$2, 12(%rsp)# i = 2
callfoo # foo(&i)
addq$24, %rsp   # clean up stack
ret

[Bug middle-end/66031] Spurious array bounds warning

2017-10-18 Thread slash.tmp at free dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66031

Mason  changed:

   What|Removed |Added

 CC||slash.tmp at free dot fr

--- Comment #2 from Mason  ---
FWIW, the following code looks somewhat fishy to me:

unsigned char f = 10;
extern unsigned char p[1];
char check(unsigned int n)
{
   if (n - 1 >= f) return p[n];
   return 1;
}

p[n] is well-defined iff n = 0

If n = 0, n - 1 = UINT_MAX, thus the test is true for any f
No other value of n is allowed as an index into p.

Thus I would expect gcc to turn the code into

  return (n == 0) ? p[0] : 1;

[Bug rtl-optimization/83272] New: Unnecessary mask instruction generated

2017-12-04 Thread slash.tmp at free dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83272

Bug ID: 83272
   Summary: Unnecessary mask instruction generated
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: slash.tmp at free dot fr
  Target Milestone: ---

Consider the following testcase:

char foo(unsigned char n)
{
static const char map[16] = "wxyz";
return map[n / 16];
}

gcc-7 -O2 -march=haswell -S testcase.c generates:

foo:
shrb$4, %dil
andl$15, %edi
movzbl  map.2295(%rdi), %eax
ret


On this platform, CHAR_BIT = 8 and UCHAR_MAX = 255
Therefore n / 16 is guaranteed to be less than 16.
Yet, GCC generates an unnecessary mask instruction (andl $15, %edi).

https://gcc.gnu.org/ml/gcc-help/2017-11/msg00102.html

[Bug rtl-optimization/83272] Unnecessary mask instruction generated

2017-12-04 Thread slash.tmp at free dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83272

--- Comment #2 from Mason  ---
(In reply to Jakub Jelinek from comment #1)

> I don't believe the andl is not needed after shrb, as that is an 8-bit
> operand size, it should leave the upper 56 bits of the register unmodified. 
> And unsigned char argument is in the ABI passed as int, so I think the upper
> 32 bits are undefined, which the andl instruction clears.

I checked the amd64 SysV ABI, and didn't see a requirement for a function
returning an INTEGER type smaller than 64 bits to clear the upper bits?
(The caller knows what bits are valid.)

Anyway, I may have oversimplified the testcase. Consider this one:

char foo(unsigned char *p)
{
static const char map[16] = "wxyz";
return map[*p / 16];
}

foo:
movzbl  (%rdi), %eax
shrb$4, %al
andl$15, %eax
movzbl  map.2295(%rax), %eax
ret


movzbl does all bits of RAX.
shrb discards 4 of the 8 bits, leaving only 4.
Thus, andl is a no-op in that case.

[Bug rtl-optimization/83272] Unnecessary mask instruction generated

2017-12-05 Thread slash.tmp at free dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83272

--- Comment #3 from Mason  ---
I think Jakub is right about an interaction between movzbl and shrb.

unsigned long long foo1(unsigned char *p) { return *p; }

foo1:
movzbl  (%rdi), %eax
ret

I.e. gcc "knows" that movzbl clears the upper bits of RAX. However...

unsigned long long foo2(unsigned char *p) { return *p / 2; }

foo2:
movzbl  (%rdi), %eax
shrb%al
movzbl  %al, %eax
ret

gcc seems to think that shrb might "mess up" some bits in RAX, which then
need to be cleared again through an extra movzbl.

"shr %al" is encoded as "d0 e8" while "shr %eax" is "d1 e8"
The former is not more efficient than the latter.

As Marc Glisse pointed out, a solution is convincing gcc to store the
intermediate results in a register:

unsigned long long foo3(unsigned char *p)
{
unsigned int temp = *p;
return temp / 2;
}

foo3:
movzbl  (%rdi), %eax
shrl%eax
ret

gcc "knows" that shrl does not "disturb" RAX, so no extra movzbl required.


According to the "integer promotions" rules, *p is promoted to int in the
expression (*p / 2) used in foo2. However...

unsigned long long foo4(unsigned char *p)
{
int temp = *p;
return temp / 2;
}

foo4:
movzbl  (%rdi), %eax
sarl%eax
cltq
ret


An 8-bit unsigned char promoted to int or unsigned int has the same value.

I expect the same code generated for foo2, foo3, foo4.
Yet we get 3 different code snippets.

[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-01 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #18 from Mason  ---
Hello Michael_S,

As far as I can see, massaging the source helps GCC generate optimal code
(in terms of instruction count, not convinced about scheduling).

#include 
typedef unsigned long long u64;
void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
{
  unsigned char c = 0;
  c = _addcarry_u64(c, A[0], B[0], dst+0);
  c = _addcarry_u64(c, A[1], B[1], dst+1);
  c = _addcarry_u64(c, A[2], B[2], dst+2);
  c = _addcarry_u64(c, A[3], B[3], dst+3);
}


On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
the expected:

add4i:
movq(%rdx), %rax
addq(%rsi), %rax
movq%rax, (%rdi)
movq8(%rsi), %rax
adcq8(%rdx), %rax
movq%rax, 8(%rdi)
movq16(%rsi), %rax
adcq16(%rdx), %rax
movq%rax, 16(%rdi)
movq24(%rdx), %rax
adcq24(%rsi), %rax
movq%rax, 24(%rdi)
ret

I'll run a few benchmarks to test optimal scheduling.

[Bug target/110104] New: gcc produces sub-optimal code for _addcarry_u64 chain

2023-06-03 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104

Bug ID: 110104
   Summary: gcc produces sub-optimal code for _addcarry_u64 chain
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: slash.tmp at free dot fr
  Target Milestone: ---

Consider the following code:

#include 
typedef unsigned long long u64;
typedef unsigned __int128 u128;
void testcase1(u64 *acc, u64 a, u64 b)
{
  u128 res = (u128)a*b;
  u64 lo = res, hi = res >> 64;
  unsigned char cf = 0;
  cf = _addcarry_u64(cf, lo, acc[0], acc+0);
  cf = _addcarry_u64(cf, hi, acc[1], acc+1);
  cf = _addcarry_u64(cf,  0, acc[2], acc+2);
}
void testcase2(u64 *acc, u64 a, u64 b)
{
  u128 res = (u128)a * b;
  u64 lo = res, hi = res >> 64;
  asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "adc $0, %[D2]" :
  [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) :
  [LO] "r" (lo), [HI] "r" (hi) : "cc");
}

Compilation with either
gcc-trunk -Wall -Wextra -O3 -S testcase.c
gcc-trunk -Wall -Wextra -Os -S testcase.c
generate the same code:

// rdi = acc, rsi = a, rdx = b

testcase1:
  movq %rsi, %rax
  mulq %rdx
  addq %rax, (%rdi)
  movq %rdx, %rax
  adcq 8(%rdi), %rax
  adcq $0, 16(%rdi)
  movq %rax, 8(%rdi)
  ret

testcase2:
  movq %rsi, %rax   ; rax = rsi = a
  mulq %rdx ; rdx:rax = rax*rdx = a*b
  add %rax, (%rdi)  ; acc[0] += lo
  adc %rdx, 8(%rdi) ; acc[1] += hi + cf
  adc $0, 16(%rdi)  ; acc[2] += cf
  ret


Conclusion:
gcc generates the expected code for testcase2.
However, the code for testcase1 is sub-optimal.

  movq %rdx, %rax
  adcq 8(%rdi), %rax
  movq %rax, 8(%rdi)

instead of

  adc %rdx, 8(%rdi) ; acc[1] += hi + cf


The copy of rdx to rax is useless.
The (load/add+store) ops can be merged into an load/add/store op.

[Bug target/102974] GCC optimization is very poor for add carry and multiplication combos

2023-06-03 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102974

--- Comment #11 from Mason  ---
Here's umul_least_64() rewritten as mul_64x64_128() in C

typedef unsigned int u32;
typedef unsigned long long u64;

/* u32 acc[3], a[1], b[1] */
static void mul_add_32x32(u32 *acc, const u32 *a, const u32 *b)
{
  u64 res = (u64)a[0] * b[0];
  u32 lo = res, hi = res >> 32;
  asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]\n\t" "adc $0, %[D2]" :
  [D0] "+m" (acc[0]), [D1] "+m" (acc[1]), [D2] "+m" (acc[2]) :
  [LO] "r" (lo), [HI] "r" (hi) : "cc");
}

/* u32 acc[5], a[2], b[2] */
void mul_64x64_128(u32 *acc, const u32 *a, const u32 *b)
{
  mul_add_32x32(acc+0, a+0, b+0);
  mul_add_32x32(acc+1, a+0, b+1);
  mul_add_32x32(acc+1, a+1, b+0);
  mul_add_32x32(acc+2, a+1, b+1);
}

gcc-trunk -O3 -m32

mul_64x64_128:
  pushl %esi
  pushl %ebx
  movl 16(%esp), %ebx   ; ebx = a
  movl 20(%esp), %esi   ; esi = b
  movl 12(%esp), %ecx   ; ecx = acc
  movl (%esi), %eax ; b0
  mull (%ebx)   ; a0*b0
  add %eax, (%ecx)
  adc %edx, 4(%ecx)
  adc $0, 8(%ecx)
  movl 4(%esi), %eax; b1
  mull (%ebx)   ; a0*b1
  add %eax, 4(%ecx)
  adc %edx, 8(%ecx)
  adc $0, 12(%ecx)
  movl (%esi), %eax ; b0
  mull 4(%ebx)  ; a1*b0
  add %eax, 4(%ecx)
  adc %edx, 8(%ecx)
  adc $0, 12(%ecx)
  movl 4(%esi), %eax; b1
  mull 4(%ebx)  ; a1*b1
  add %eax, 8(%ecx)
  adc %edx, 12(%ecx)
  adc $0, 16(%ecx)
  popl %ebx
  popl %esi
  ret

[Bug target/102974] GCC optimization is very poor for add carry and multiplication combos

2023-06-05 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102974

--- Comment #12 from Mason  ---
Actually, in this case, we don't need to propagate the carry over 3 limbs.

typedef unsigned int u32;
typedef unsigned long long u64;

/* u32 acc[2], a[1], b[1] */
static void mul_add_32x32(u32 *acc, const u32 *a, const u32 *b)
{
  u64 res = (u64)a[0] * b[0];
  u32 lo = res, hi = res >> 32;
  asm("add %[LO], %[D0]\n\t" "adc %[HI], %[D1]" :
  [D0] "+m" (acc[0]), [D1] "+m" (acc[1]) :
  [LO] "r" (lo), [HI] "r" (hi) : "cc");
}

/* u32 acc[4], a[2], b[2] */
void mul_64x64_128(u32 *acc, const u32 *a, const u32 *b)
{
  mul_add_32x32(acc+0, a+0, b+0);
  mul_add_32x32(acc+1, a+0, b+1);
  mul_add_32x32(acc+1, a+1, b+0);
  mul_add_32x32(acc+2, a+1, b+1);
}

gcc-trunk -O3 -m32

mul_64x64_128:
  pushl %esi
  pushl %ebx
  movl 16(%esp), %ebx
  movl 20(%esp), %esi
  movl 12(%esp), %ecx
  movl (%esi), %eax
  mull (%ebx)
  add %eax, (%ecx)
  adc %edx, 4(%ecx)
  movl 4(%esi), %eax
  mull (%ebx)
  add %eax, 4(%ecx)
  adc %edx, 8(%ecx)
  movl (%esi), %eax
  mull 4(%ebx)
  add %eax, 4(%ecx)
  adc %edx, 8(%ecx)
  movl 4(%esi), %eax
  mull 4(%ebx)
  add %eax, 8(%ecx)
  adc %edx, 12(%ecx)
  popl %ebx
  popl %esi
  ret

[Bug target/102974] GCC optimization is very poor for add carry and multiplication combos

2023-06-06 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102974

--- Comment #16 from Mason  ---
For the record, the example I provided was intended to show that, with some
help, GCC can generate good code for bigint multiplication. In this situation,
"help" means a short asm template.

[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-13 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #20 from Mason  ---
Doh! You're right.
I come from a background where overlapping/aliasing inputs are heresy,
thus got blindsided :(

This would be the optimal code, right?

add4i:
# rdi = dst, rsi = a, rdx = b
movq 0(%rdx), %r8
movq 8(%rdx), %rax
movq16(%rdx), %rcx
movq24(%rdx), %rdx
addq 0(%rsi), %r8
adcq 8(%rsi), %rax
adcq16(%rsi), %rcx
adcq24(%rsi), %rdx
movq%r8,   0(%rdi)
movq%rax,  8(%rdi)
movq%rcx, 16(%rdi)
movq%rdx, 24(%rdi)
ret


FWIW, trunk generates even nastier code for znver2:

add4i:
movq(%rdx), %rax
addq(%rsi), %rax
movq8(%rsi), %rcx
adcq8(%rdx), %rcx
movq16(%rsi), %r8
adcq16(%rdx), %r8
movq24(%rdx), %rdx
adcq24(%rsi), %rdx
vmovq   %rax, %xmm2
vpinsrq $1, %rcx, %xmm2, %xmm0
vmovq   %r8, %xmm1
vpinsrq $1, %rdx, %xmm1, %xmm1
vinserti128 $0x1, %xmm1, %ymm0, %ymm0
vmovdqu %ymm0, (%rdi)
vzeroupper
ret

[Bug target/110104] gcc produces sub-optimal code for _addcarry_u64 chain

2023-06-14 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104

--- Comment #2 from Mason  ---
You meant PR79173 ;)

Latest update:
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621554.html

I didn't see my testcase specifically in Jakub's patch,
but I'll test trunk on godbolt when/if the patch lands.

[Bug target/110104] gcc produces sub-optimal code for _addcarry_u64 chain

2023-06-16 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104

--- Comment #4 from Mason  ---
I confirm that trunk now emits the same code for testcase1 and testcase2.
Thanks Jakub and Roger, great work!

[Bug target/110104] gcc produces sub-optimal code for _addcarry_u64 chain

2023-07-07 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104

--- Comment #5 from Mason  ---
FWIW, trunk (gcc14) translates testcase3 to the same code as the other
testcases, while remaining portable across all architectures:

$ gcc-trunk -O3 -march=bdver3 testcase3.c

typedef unsigned long long u64;
typedef unsigned __int128 u128;
void testcase3(u64 *acc, u64 a, u64 b)
{
  int c1, c2;
  u128 res = (u128)a * b;
  u64 lo = res, hi = res >> 64;
  c1 = __builtin_add_overflow(lo, acc[0], &acc[0]);
  c2 = __builtin_add_overflow(hi, acc[1], &acc[1])
 | __builtin_add_overflow(c1, acc[1], &acc[1]);
   __builtin_add_overflow(c2, acc[2], &acc[2]);
}

testcase3:
movq%rsi, %rax
mulq%rdx
addq%rax, (%rdi)
adcq%rdx, 8(%rdi)
adcq$0, 16(%rdi)
ret

Thanks again, Jakub.