many cases

already5chosen at yahoo dot com via Gcc-bugs Wed, 07 Jun 2023 16:16:55 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617


--- Comment #19 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Mason from comment #18)
> Hello Michael_S,
> 
> As far as I can see, massaging the source helps GCC generate optimal code
> (in terms of instruction count, not convinced about scheduling).
> 
> #include <x86intrin.h>
> typedef unsigned long long u64;
> void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
> {
>   unsigned char c = 0;
>   c = _addcarry_u64(c, A[0], B[0], dst+0);
>   c = _addcarry_u64(c, A[1], B[1], dst+1);
>   c = _addcarry_u64(c, A[2], B[2], dst+2);
>   c = _addcarry_u64(c, A[3], B[3], dst+3);
> }
> 
> 
> On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
> the expected:
> 
> add4i:
>         movq    (%rdx), %rax
>         addq    (%rsi), %rax
>         movq    %rax, (%rdi)
>         movq    8(%rsi), %rax
>         adcq    8(%rdx), %rax
>         movq    %rax, 8(%rdi)
>         movq    16(%rsi), %rax
>         adcq    16(%rdx), %rax
>         movq    %rax, 16(%rdi)
>         movq    24(%rdx), %rax
>         adcq    24(%rsi), %rax
>         movq    %rax, 24(%rdi)
>         ret
> 
> I'll run a few benchmarks to test optimal scheduling.

That's not merely "massaging the source". That's changing semantics.
Think about what happens when dst points to the middle of A or of B.
The change of semantics effectively prevented vectorizer from doing harm.

And yes, for common non-aliasing case the scheduling is problematic, too. 
It would probably not cause slowdown on the latest and greatest cores, but
could be slow on less great cores, including your default Zen1.

[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

Reply via email to