[Bug c/104439] New: arm crc feature not enabled in assembly for function with crc target attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104439 Bug ID: 104439 Summary: arm crc feature not enabled in assembly for function with crc target attribute Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: ebiggers3 at gmail dot com Target Milestone: --- Minimized reproducer: void some_other_function(void) { } unsigned int __attribute__((target("arch=armv8-a+crc"))) crc32_arm(unsigned int crc, const unsigned char *data, unsigned long size) { for (unsigned long i = 0; i < size; i++) crc = __builtin_arm_crc32b(crc, data[i]); return crc; } #pragma GCC push_options #pragma GCC pop_options $ arm-linux-gnueabihf-gcc -c test.c /tmp/ccCUwib8.s: Assembler messages: /tmp/ccCUwib8.s:58: Error: selected processor does not support `crc32b r3,r3,r2' in ARM mode $ arm-linux-gnueabihf-gcc --version arm-linux-gnueabihf-gcc (GCC) 11.2.0 The crc32_arm() function is compiled with crc instructions enabled, and gcc is emitting assembly code using them. However, gcc isn't emitting the directives that tell the assembler to allow these instructions. The problem goes away if either some_other_function() is deleted, or if the "GCC push_options" and "GCC pop_options" pair is deleted. I don't see any logical reason why either change would make a difference. (The original, non-minimized code this problem was seen on can be found at https://github.com/ebiggers/libdeflate/blob/v1.9/lib/arm/crc32_impl.h#L75.)
[Bug target/104439] arm crc feature not enabled in assembly for function with crc target attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104439 --- Comment #3 from Eric Biggers --- I ran a bisection and found that the following commit fixed this bug: commit c1cdabe3aab817d95a8db00a8b5e9f6bcdea936f Author: Richard Earnshaw Date: Thu Jul 29 11:00:31 2021 +0100 arm: reorder assembler architecture directives [PR101723] This commit is on the branches origin/release/gcc-9, origin/releases/gcc-10, and origin/releases/gcc-11. That means it will be in gcc 9.5, 10.4, and 11.3, right? It looks like gcc-8 is no longer maintained. So, it looks like there's nothing else to do here and this can be closed.
[Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892 Bug ID: 107892 Summary: Unnecessary move between ymm registers in loop using AVX2 intrinsic Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ebiggers3 at gmail dot com Target Milestone: --- To reproduce with the latest trunk, compile the following .c file on x86_64 at -O2: #include int __attribute__((target("avx2"))) sum_ints(const __m256i *p, size_t n) { __m256i a = _mm256_setzero_si256(); __m128i b; do { a = _mm256_add_epi32(a, *p++); } while (--n); b = _mm_add_epi32(_mm256_extracti128_si256(a, 0), _mm256_extracti128_si256(a, 1)); b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31)); b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02)); return _mm_cvtsi128_si32(b); } The assembly that gcc generates is: : 0: c5 f1 ef c9 vpxor %xmm1,%xmm1,%xmm1 4: 0f 1f 40 00 nopl 0x0(%rax) 8: c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0 c: 48 83 c7 20 add$0x20,%rdi 10: c5 fd 6f c8 vmovdqa %ymm0,%ymm1 14: 48 83 ee 01 sub$0x1,%rsi 18: 75 ee jne8 1a: c4 e3 7d 39 c1 01 vextracti128 $0x1,%ymm0,%xmm1 20: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 24: c5 f9 70 c8 31 vpshufd $0x31,%xmm0,%xmm1 29: c5 f1 fe c8 vpaddd %xmm0,%xmm1,%xmm1 2d: c5 f9 70 c1 02 vpshufd $0x2,%xmm1,%xmm0 32: c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0 36: c5 f9 7e c0 vmovd %xmm0,%eax 3a: c5 f8 77vzeroupper 3d: c3 ret The bug is that the inner loop contains an unnecessary vmovdqa: 8: vpaddd (%rdi),%ymm1,%ymm0 add$0x20,%rdi vmovdqa %ymm0,%ymm1 sub$0x1,%rsi jne8 It should look like the following instead: 8: vpaddd (%rdi),%ymm0,%ymm0 add$0x20,%rdi sub$0x1,%rsi jne8 Strangely, the bug goes away if the __v8si type is used instead of __m256i and the addition is done using "+=" instead of _mm256_add_epi32(): int __attribute__((target("avx2"))) sum_ints_good(const __v8si *p, size_t n) { __v8si a = {}; __m128i b; do { a += *p++; } while (--n); b = _mm_add_epi32(_mm256_extracti128_si256((__m256i)a, 0), _mm256_extracti128_si256((__m256i)a, 1)); b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31)); b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02)); return _mm_cvtsi128_si32(b); } In the bad version, I noticed that the RTL initially has two separate insns for 'a += *p': one to do the addition and write the result to a new pseudo register, and one to convert the value from mode V8SI to V4DI and assign it to the original pseudo register. These two separate insns never get combined. (That sort of explains why the bug isn't seen with the __v8si and += method; gcc doesn't do a type conversion with that method.) So, I'm wondering if the bug is in the instruction combining pass. Or perhaps the RTL should never have had two separate insns in the first place?
[Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892 --- Comment #1 from Eric Biggers --- The reproducer I gave in my first comment doesn't reproduce the bug on releases/gcc-11.1.0, so it must have regressed between then and trunk. I can do a bisection if needed. However, I actually still see the bug with gcc-11.1.0 on my original unminimized code at https://github.com/ebiggers/libdeflate/blob/fb0c43373f6fe600471457f4c021b8ad7e4bbabf/lib/x86/adler32_impl.h#L142. So maybe the reproducer I gave is not the best one. Here is a slightly different reproducer that reproduces the bug with both gcc-11.1.0 and trunk: #include __m256i __attribute__((target("avx2"))) f(const __m256i *p, size_t n) { __m256i a = _mm256_setzero_si256(); do { a = _mm256_add_epi32(a, *p++); } while (--n); return _mm256_madd_epi16(a, a); } The assembly of the loop has the unnecessary vmovdqa: 8: c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0 c: 48 83 c7 20 add$0x20,%rdi 10: c5 fd 6f c8 vmovdqa %ymm0,%ymm1 14: 48 83 ee 01 sub$0x1,%rsi 18: 75 ee jne8
[Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892 --- Comment #2 from Eric Biggers --- This is also reproducible with SSE2.