https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049
--- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #9)
> Perhaps the r12-2288-g8695bf78dad1a42636 change wasn't a good idea?
I think it's still a good idea as it fixes a bigger problem (unneeded SIMD
partial extracts) and makes it easier to write RTL as you don't have to deal
with both VEC_SELECT and subregs. So having one canonical form is better.
> I mean, if we add some hack for the .REDUC_* stuff so that we don't have the
> lowpart vec_select that r12-2288 folds into a subreg, won't we still suffer
> the same problem when doing anything similar?
Yes but I think the problem is in how we do the transfers to start with. While
looking at this issue I noticed that the SIMD <-> genreg transfers for sizes
where we don't have an exact register for on the genreg (i.e. 8-bit and 16-bit)
are suboptimal (even before this) in a number of cases already and dealing with
that underlying problem first is better, so I postponed it to GCC 13.
That is to say, even
typedef int V __attribute__((vector_size (4 * sizeof (int))));
int
test (V a)
{
int sum = a[0];
return (unsigned int)sum >> 16;
}
is suboptimal.
> E.g. with -O2:
>
> typedef int V __attribute__((vector_size (4 * sizeof (int))));
>
> int
> test (V a)
> {
> int sum = a[0];
> return (((unsigned short)sum) + ((unsigned int)sum >> 16)) >> 1;
> }
>
> The assembly difference is then:
> - fmov w0, s0
> - lsr w1, w0, 16
> - add w0, w1, w0, uxth
> + umov w0, v0.h[0]
> + fmov w1, s0
> + add w0, w0, w1, lsr 16
> lsr w0, w0, 1
> ret
> Dunno how costly on aarch64 is Neon -> GPR register move.
> Is fmov w0, s0; fmov w1, s0 or fmov w0, s0; mov w1, w0 cheaper?
The answer is quite uarch specific, but in general fmov w0, s0; mov w1, w0 is
cheaper, that said for the sequence you pasted above the it's really a bit of a
wash.
The old codegen has a longer dependency chain and needed both a shift and
zero-extend,
The new codegen removes the zero extends and folds the shift into the add but
adds a transfer so it about cancels out.
Ideally we'd want here:
umov w0, v0.h[0]
umov w1, v0.h[1]
add w0, w0, w1
lsr w0, w0, 1
ret
where the shift and the zero extend are gone and the moves could be done in
parallel.