https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116560
--- Comment #4 from Jeffrey A. Law <law at gcc dot gnu.org> ---
So a few more notes.
ISTM part of the problem is the bswap pass is ignoring the fact that the target
may not have efficient support for unaligned accesses. So it creates a 16 bit
load from the two 8 bit elements.
Then the target has to break that down into a safe sequence. Then it expands
the bswap idiom, then we have a redundant zero extension.
If you use -mtune=thead-c906 to select a design that has fast unaligned access,
you'll see improved, but not great code -- the redundant extension is still in
there:
lhu a0,0(a0) # 7 [c=28 l=4] *zero_extendhisi2/1
slli a4,a0,8 # 11 [c=4 l=4] *ashlsi3
srli a5,a0,8 # 10 [c=4 l=4] *lshrsi3
add a5,a5,a4 # 14 [c=4 l=4] *addsi3/0
slli a5,a5,16 # 31 [c=4 l=4] *ashlsi3
srli a5,a5,16 # 32 [c=4 l=4] *lshrsi3
seqz a0,a0 # 24 [c=4 l=4] *seq_zero_sisi
sw a5,0(a1) # 16 [c=4 l=4] *movsi_internal/3
ret # 35 [c=0 l=4] simple_return
We can get the same thing by hacking up the relevant part of the bswap pass to
query if the target has fast unaligned access.
The redundant zero extension is more problematical. For reasons I still don't
undrestand our nonzero bit tracking totally mucks up the set of bits
potentially nonzero. If we put a breakpoint in combine_simplify_rtx we
eventually see:
(ior:SI (and:SI (reg:SI 147)
(const_int 65280 [0xff00]))
(and:SI (subreg:SI (reg:HI 148 [ _5 ]) 0)
(const_int 255 [0xff])))
And if we query the nonzero bits of (reg 147) and (reg 148) we get:
(gdb) p/x nonzero_bits (x.u.fld[0].rt_rtx.u.fld[0].rt_rtx, E_SImode)
$198 = 0xff00ff00
(gdb) p/x nonzero_bits (x.u.fld[1].rt_rtx.u.fld[0].rt_rtx, E_SImode)
$203 = 0xffff00ff
The second result kind of makes sense in that we've got a paradoxical subreg,
so the bits outside HImode are undefined.