[Bug c/94994] New: [10 Regression] possible miscompilation of word-at-a-time copy via packed structs

2020-05-08 Thread ebiggers3 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94994

Bug ID: 94994
   Summary: [10 Regression] possible miscompilation of
word-at-a-time copy via packed structs
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ebiggers3 at gmail dot com
  Target Milestone: ---

Starting with gcc 10, the following code (based on
https://github.com/ebiggers/libdeflate/blob/v1.5/lib/decompress_template.h#L353)
isn't compiled as expected at -O3 on x86_64:


#include 
#include 

#define WORDSIZEsizeof(size_t)

struct word_unaligned {
size_t x;
} __attribute__((packed, may_alias));

static inline size_t load_word_unaligned(const char *p)
{
return ((struct word_unaligned *)p)->x;
}

static inline void store_word_unaligned(size_t x, char *p)
{
((struct word_unaligned *)p)->x = x;
}

void __attribute__((noinline))
copy(char *dst, const char *src, size_t word_count)
{
do {
store_word_unaligned(load_word_unaligned(src), dst);
src += WORDSIZE;
dst += WORDSIZE;
} while (--word_count);
}

int main()
{
char buf[9 + 6 * WORDSIZE + 1] = "012345678";

copy(&buf[9], &buf[0], 6);

puts(buf);
}


The code is supposed to copy 6 eight-byte words from &buf[0] to &buf[9], one
word at a time, resulting in the output
"012345678012345678012345678012345678012345678012345678012".  But the actual
output is "012345678012345678".  Based on the diassembly, this seems to be
caused by gcc assuming that the src and dst pointers are offset from each other
by a multiple of 8 bytes.  It uses this assumption to generate
16-byte-at-a-time SSE copy code.  But that's invalid when the pointers are
actually offset by 9 bytes as in the example.

Is this working as intended?  I don't think so, since the use of packed and
may_alias should make gcc assume that the pointers can have any alignment, and
each store can alias the next load.  But perhaps there's some reason I'm
missing why the code could nevertheless be considered incorrect, or perhaps
there's some ambiguity in what 'packed' is supposed to do (it's a gcc extension
after all).

I'm going to change my code to use memcpy() anyway, since the bugs where gcc
generated bad code for memcpy() in some cases have supposedly been fixed in
recent gcc's.  But I thought I'd point this out since I'm not sure it's working
as intended, and probably other people will run into it too.  Decompression
code is most likely to be affected by this.

gcc 9.3.0 works fine.  I didn't test anything in between that and 10.1.

[Bug c/104439] New: arm crc feature not enabled in assembly for function with crc target attribute

2022-02-07 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104439

Bug ID: 104439
   Summary: arm crc feature not enabled in assembly for function
with crc target attribute
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ebiggers3 at gmail dot com
  Target Milestone: ---

Minimized reproducer:

void some_other_function(void) { }

unsigned int __attribute__((target("arch=armv8-a+crc")))
crc32_arm(unsigned int crc, const unsigned char *data, unsigned long size)
{
for (unsigned long i = 0; i < size; i++)
crc = __builtin_arm_crc32b(crc, data[i]);
return crc;
}

#pragma GCC push_options
#pragma GCC pop_options

$ arm-linux-gnueabihf-gcc -c test.c
/tmp/ccCUwib8.s: Assembler messages:
/tmp/ccCUwib8.s:58: Error: selected processor does not support `crc32b
r3,r3,r2' in ARM mode
$ arm-linux-gnueabihf-gcc --version
arm-linux-gnueabihf-gcc (GCC) 11.2.0

The crc32_arm() function is compiled with crc instructions enabled, and gcc is
emitting assembly code using them.  However, gcc isn't emitting the directives
that tell the assembler to allow these instructions.

The problem goes away if either some_other_function() is deleted, or if the
"GCC push_options" and "GCC pop_options" pair is deleted.  I don't see any
logical reason why either change would make a difference.

(The original, non-minimized code this problem was seen on can be found at
https://github.com/ebiggers/libdeflate/blob/v1.9/lib/arm/crc32_impl.h#L75.)

[Bug target/104439] arm crc feature not enabled in assembly for function with crc target attribute

2022-02-07 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104439

--- Comment #3 from Eric Biggers  ---
I ran a bisection and found that the following commit fixed this bug:

  commit c1cdabe3aab817d95a8db00a8b5e9f6bcdea936f
  Author: Richard Earnshaw 
  Date:   Thu Jul 29 11:00:31 2021 +0100

  arm: reorder assembler architecture directives [PR101723]

This commit is on the branches origin/release/gcc-9, origin/releases/gcc-10,
and origin/releases/gcc-11.  That means it will be in gcc 9.5, 10.4, and 11.3,
right?  It looks like gcc-8 is no longer maintained.

So, it looks like there's nothing else to do here and this can be closed.

[Bug rtl-optimization/107892] New: Unnecessary move between ymm registers in loop using AVX2 intrinsic

2022-11-27 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

Bug ID: 107892
   Summary: Unnecessary move between ymm registers in loop using
AVX2 intrinsic
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ebiggers3 at gmail dot com
  Target Milestone: ---

To reproduce with the latest trunk, compile the following .c file on x86_64 at
-O2:

#include 

int __attribute__((target("avx2")))
sum_ints(const __m256i *p, size_t n)
{
__m256i a = _mm256_setzero_si256();
__m128i b;

do {
a = _mm256_add_epi32(a, *p++);
} while (--n);

b = _mm_add_epi32(_mm256_extracti128_si256(a, 0),
  _mm256_extracti128_si256(a, 1));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
return _mm_cvtsi128_si32(b);
}

The assembly that gcc generates is:

 :
   0:   c5 f1 ef c9 vpxor  %xmm1,%xmm1,%xmm1
   4:   0f 1f 40 00 nopl   0x0(%rax)
   8:   c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0
   c:   48 83 c7 20 add$0x20,%rdi
  10:   c5 fd 6f c8 vmovdqa %ymm0,%ymm1
  14:   48 83 ee 01 sub$0x1,%rsi
  18:   75 ee   jne8 
  1a:   c4 e3 7d 39 c1 01   vextracti128 $0x1,%ymm0,%xmm1
  20:   c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0
  24:   c5 f9 70 c8 31  vpshufd $0x31,%xmm0,%xmm1
  29:   c5 f1 fe c8 vpaddd %xmm0,%xmm1,%xmm1
  2d:   c5 f9 70 c1 02  vpshufd $0x2,%xmm1,%xmm0
  32:   c5 f9 fe c1 vpaddd %xmm1,%xmm0,%xmm0
  36:   c5 f9 7e c0 vmovd  %xmm0,%eax
  3a:   c5 f8 77vzeroupper
  3d:   c3  ret

The bug is that the inner loop contains an unnecessary vmovdqa:

   8:   vpaddd (%rdi),%ymm1,%ymm0
add$0x20,%rdi
vmovdqa %ymm0,%ymm1
sub$0x1,%rsi
jne8 

It should look like the following instead:

   8:   vpaddd (%rdi),%ymm0,%ymm0
add$0x20,%rdi
sub$0x1,%rsi
jne8 

Strangely, the bug goes away if the __v8si type is used instead of __m256i and
the addition is done using "+=" instead of _mm256_add_epi32():

int __attribute__((target("avx2")))
sum_ints_good(const __v8si *p, size_t n)
{
__v8si a = {};
__m128i b;

do {
a += *p++;
} while (--n);

b = _mm_add_epi32(_mm256_extracti128_si256((__m256i)a, 0),
  _mm256_extracti128_si256((__m256i)a, 1));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x31));
b = _mm_add_epi32(b, _mm_shuffle_epi32(b, 0x02));
return _mm_cvtsi128_si32(b);
}

In the bad version, I noticed that the RTL initially has two separate insns for
'a += *p': one to do the addition and write the result to a new pseudo
register, and one to convert the value from mode V8SI to V4DI and assign it to
the original pseudo register.  These two separate insns never get combined. 
(That sort of explains why the bug isn't seen with the __v8si and += method;
gcc doesn't do a type conversion with that method.)  So, I'm wondering if the
bug is in the instruction combining pass.  Or perhaps the RTL should never have
had two separate insns in the first place?

[Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic

2022-11-28 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

--- Comment #1 from Eric Biggers  ---
The reproducer I gave in my first comment doesn't reproduce the bug on
releases/gcc-11.1.0, so it must have regressed between then and trunk.  I can
do a bisection if needed.

However, I actually still see the bug with gcc-11.1.0 on my original
unminimized code at
https://github.com/ebiggers/libdeflate/blob/fb0c43373f6fe600471457f4c021b8ad7e4bbabf/lib/x86/adler32_impl.h#L142.
 So maybe the reproducer I gave is not the best one.  Here is a slightly
different reproducer that reproduces the bug with both gcc-11.1.0 and trunk:

#include 

__m256i __attribute__((target("avx2")))
f(const __m256i *p, size_t n)
{
__m256i a = _mm256_setzero_si256();

do {
a = _mm256_add_epi32(a, *p++);
} while (--n);

return _mm256_madd_epi16(a, a);
}

The assembly of the loop has the unnecessary vmovdqa:

   8:   c5 f5 fe 07 vpaddd (%rdi),%ymm1,%ymm0
   c:   48 83 c7 20 add$0x20,%rdi
  10:   c5 fd 6f c8 vmovdqa %ymm0,%ymm1
  14:   48 83 ee 01 sub$0x1,%rsi
  18:   75 ee   jne8 

[Bug rtl-optimization/107892] Unnecessary move between ymm registers in loop using AVX2 intrinsic

2022-11-28 Thread ebiggers3 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107892

--- Comment #2 from Eric Biggers  ---
This is also reproducible with SSE2.