[Bug c/104439] New: arm crc feature not enabled in assembly for function with crc target attribute

2022-02-07 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104439

Bug ID: 104439
   Summary: arm crc feature not enabled in assembly for function
with crc target attribute
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ebiggers3 at gmail dot com
  Target Milestone: ---

Minimized reproducer:

void some_other_function(void) { }

unsigned int __attribute__((target("arch=armv8-a+crc")))
crc32_arm(unsigned int crc, const unsigned char *data, unsigned long size)
{
for (unsigned long i = 0; i < size; i++)
crc = __builtin_arm_crc32b(crc, data[i]);
return crc;
}

#pragma GCC push_options
#pragma GCC pop_options

$ arm-linux-gnueabihf-gcc -c test.c
/tmp/ccCUwib8.s: Assembler messages:
/tmp/ccCUwib8.s:58: Error: selected processor does not support `crc32b
r3,r3,r2' in ARM mode
$ arm-linux-gnueabihf-gcc --version
arm-linux-gnueabihf-gcc (GCC) 11.2.0

The crc32_arm() function is compiled with crc instructions enabled, and gcc is
emitting assembly code using them.  However, gcc isn't emitting the directives
that tell the assembler to allow these instructions.

The problem goes away if either some_other_function() is deleted, or if the
"GCC push_options" and "GCC pop_options" pair is deleted.  I don't see any
logical reason why either change would make a difference.

(The original, non-minimized code this problem was seen on can be found at
https://github.com/ebiggers/libdeflate/blob/v1.9/lib/arm/crc32_impl.h#L75.)

[Bug target/104439] arm crc feature not enabled in assembly for function with crc target attribute

2022-02-07 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104439

--- Comment #3 from Eric Biggers  ---
I ran a bisection and found that the following commit fixed this bug:

  commit c1cdabe3aab817d95a8db00a8b5e9f6bcdea936f
  Author: Richard Earnshaw 
  Date:   Thu Jul 29 11:00:31 2021 +0100

  arm: reorder assembler architecture directives [PR101723]

This commit is on the branches origin/release/gcc-9, origin/releases/gcc-10,
and origin/releases/gcc-11.  That means it will be in gcc 9.5, 10.4, and 11.3,
right?  It looks like gcc-8 is no longer maintained.

So, it looks like there's nothing else to do here and this can be closed.

[Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic

2022-11-27 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

Bug ID: 107892
   Summary: Unnecessary move between ymm registers in loop using
AVX2 intrinsic
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ebiggers3 at gmail dot com
  Target Milestone: ---

To reproduce with the latest trunk, compile the following .c file on x86_64 at
-O2:

#include 

int __attribute__((target("avx2")))
sum_ints(const __m256i *p, size_t n)
{
__m256i a = _mm256_setzero_si256();
__m128i b;

do {
a = _mm256_add_epi32(a, *p++);
} while (--n);

b = _mm_add_epi32(_mm256_extracti128_si256(a, 0),
  _mm256_extracti128_si256(a, 1));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
return _mm_cvtsi128_si32(b);
}

The assembly that gcc generates is:

 :
   0:   c5 f1 ef c9 vpxor  %xmm1,%xmm1,%xmm1
   4:   0f 1f 40 00 nopl   0x0(%rax)
   8:   c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0
   c:   48 83 c7 20 add$0x20,%rdi
  10:   c5 fd 6f c8 vmovdqa %ymm0,%ymm1
  14:   48 83 ee 01 sub$0x1,%rsi
  18:   75 ee   jne8 
  1a:   c4 e3 7d 39 c1 01   vextracti128 $0x1,%ymm0,%xmm1
  20:   c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0
  24:   c5 f9 70 c8 31  vpshufd $0x31,%xmm0,%xmm1
  29:   c5 f1 fe c8 vpaddd %xmm0,%xmm1,%xmm1
  2d:   c5 f9 70 c1 02  vpshufd $0x2,%xmm1,%xmm0
  32:   c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0
  36:   c5 f9 7e c0 vmovd  %xmm0,%eax
  3a:   c5 f8 77vzeroupper
  3d:   c3  ret

The bug is that the inner loop contains an unnecessary vmovdqa:

   8:   vpaddd (%rdi),%ymm1,%ymm0
add$0x20,%rdi
vmovdqa %ymm0,%ymm1
sub$0x1,%rsi
jne8 

It should look like the following instead:

   8:   vpaddd (%rdi),%ymm0,%ymm0
add$0x20,%rdi
sub$0x1,%rsi
jne8 

Strangely, the bug goes away if the __v8si type is used instead of __m256i and
the addition is done using "+=" instead of _mm256_add_epi32():

int __attribute__((target("avx2")))
sum_ints_good(const __v8si *p, size_t n)
{
__v8si a = {};
__m128i b;

do {
a += *p++;
} while (--n);

b = _mm_add_epi32(_mm256_extracti128_si256((__m256i)a, 0),
  _mm256_extracti128_si256((__m256i)a, 1));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
return _mm_cvtsi128_si32(b);
}

In the bad version, I noticed that the RTL initially has two separate insns for
'a += *p': one to do the addition and write the result to a new pseudo
register, and one to convert the value from mode V8SI to V4DI and assign it to
the original pseudo register.  These two separate insns never get combined. 
(That sort of explains why the bug isn't seen with the __v8si and += method;
gcc doesn't do a type conversion with that method.)  So, I'm wondering if the
bug is in the instruction combining pass.  Or perhaps the RTL should never have
had two separate insns in the first place?

[Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic

2022-11-28 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

--- Comment #1 from Eric Biggers  ---
The reproducer I gave in my first comment doesn't reproduce the bug on
releases/gcc-11.1.0, so it must have regressed between then and trunk.  I can
do a bisection if needed.

However, I actually still see the bug with gcc-11.1.0 on my original
unminimized code at
https://github.com/ebiggers/libdeflate/blob/fb0c43373f6fe600471457f4c021b8ad7e4bbabf/lib/x86/adler32_impl.h#L142.
 So maybe the reproducer I gave is not the best one.  Here is a slightly
different reproducer that reproduces the bug with both gcc-11.1.0 and trunk:

#include 

__m256i __attribute__((target("avx2")))
f(const __m256i *p, size_t n)
{
__m256i a = _mm256_setzero_si256();

do {
a = _mm256_add_epi32(a, *p++);
} while (--n);

return _mm256_madd_epi16(a, a);
}

The assembly of the loop has the unnecessary vmovdqa:

   8:   c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0
   c:   48 83 c7 20 add$0x20,%rdi
  10:   c5 fd 6f c8 vmovdqa %ymm0,%ymm1
  14:   48 83 ee 01 sub$0x1,%rsi
  18:   75 ee   jne8 

[Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic

2022-11-28 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

--- Comment #2 from Eric Biggers  ---
This is also reproducible with SSE2.