The atomic_load/storedi_1 patterns are fixed to use LM, STM.
I've had a go at generating better code in the HQImode CAS
loop for aligned memory, but I don't know that I'd call it
the most efficient thing ever. Some of this is due to
deficiencies in other parts of the compiler (including the
s390 backend):
(1) MEM_ALIGN can't pass down full align+ofs data that we had
during cfgexpand. This means the opportunities for using
the "aligned" path are less than they ought to be.
(2) In get_pointer_alignment (used by get_builtin_sync_mem),
we don't consider an ADDR_EXPR to return the full alignment
that the type is due. I'm sure this is to work around some
other sort of usage via the <string.h> builtins, but it's
less-than-handy in this case.
I wonder if in get_builtin_sync_mem we ought to be using
get_object_alignment (build_fold_indirect_ref (addr)) instead?
Consider
struct S { int x; unsigned short y; } g_s;
unsigned short o, n;
void good() {
__builtin_compare_exchange (&g_s.y, &o, n, 0, 0, 0);
}
void bad(S *p_s) {
__builtin_compare_exchange (&p_s->y, &o, n, 0, 0, 0);
}
where GOOD produces the aligned MEM that we need, and BAD doesn't.
(3) Support for IC, and ICM via the insv pattern is lacking.
I've added a tiny bit of support here, in the form of using
the existing strict_low_part patterns, but most definitely we
could do better.
(4) The *sethighpartsi and *sethighpartdi_64 patterns ought to be
more different. As is, we can't insert into bits 48-56 of a
DImode quantity, because we don't generate ICM for DImode,
only ICMH.
(5) Missing support for RISBGZ in the form of an extv/z expander.
The existing *extv/z splitters probably ought to be conditionalized
on !Z10.
(6) The strict_low_part patterns should allow registers for at
least Z10. The SImode strict_low_part can use LR everywhere.
(7) RISBGZ could be used for a 3-address constant lshrsi3 before
srlk is available.
For the GOOD function above, and this patch set, for -O3 -march=z10:
larl %r3,s+4
lhrl %r0,o
lhi %r2,1
l %r1,0(%r3)
nilh %r1,0
.L2:
lr %r5,%r1
larl %r12,n
lr %r4,%r1
risbg %r4,%r0,32,47,16
icm %r5,3,0(%r12)
cs %r4,%r5,0(%r3)
je .L3
lr %r5,%r4
nilh %r5,0
cr %r5,%r1
lr %r1,%r5
jne .L2
lhi %r2,0
.L3:
srl %r4,16
sthrl %r4,o
Odd things:
* O is forced into a register before reaching the expander, so we
get the RISBG for that. N is left in a memory and so we commit
to using ICM for that. Further, because of how strict_low_part
is implemented we're committed to leaving that in memory.
* We don't optimize the loop and hoist the LARL of N outside the loop.
* Given that we're having to zap the mask in %r1 for the second
compare anyway, I wonder if RISBG is really beneficial over OR.
Is RISBG (or ICM for that matter) any faster (or even smaller)?
r~
Richard Henderson (2):
s390: Reorg s390_expand_insv
s390: Convert from sync to atomic optabs
gcc/config/s390/s390-protos.h | 3 +-
gcc/config/s390/s390.c | 270 ++++++++++++++++++----------
gcc/config/s390/s390.md | 401 +++++++++++++++++++++++++++++------------
3 files changed, 465 insertions(+), 209 deletions(-)
--
1.7.7.6