On Mon, 30 Jul 2012, Richard Henderson wrote: > The atomic_load/storedi_1 patterns are fixed to use LM, STM. > > I've had a go at generating better code in the HQImode CAS > loop for aligned memory, but I don't know that I'd call it > the most efficient thing ever. Some of this is due to > deficiencies in other parts of the compiler (including the > s390 backend): > > (1) MEM_ALIGN can't pass down full align+ofs data that we had > during cfgexpand. This means the opportunities for using > the "aligned" path are less than they ought to be. > > (2) In get_pointer_alignment (used by get_builtin_sync_mem), > we don't consider an ADDR_EXPR to return the full alignment > that the type is due. I'm sure this is to work around some > other sort of usage via the <string.h> builtins, but it's > less-than-handy in this case. > > I wonder if in get_builtin_sync_mem we ought to be using > get_object_alignment (build_fold_indirect_ref (addr)) instead? > > Consider > > struct S { int x; unsigned short y; } g_s; > unsigned short o, n; > void good() { > __builtin_compare_exchange (&g_s.y, &o, n, 0, 0, 0); > } > void bad(S *p_s) { > __builtin_compare_exchange (&p_s->y, &o, n, 0, 0, 0); > } > > where GOOD produces the aligned MEM that we need, and BAD doesn't.
You cannot generally use get_object_alignment here. Once we have an address in the middle-end we treat it as generic pointer, happily not caring about the actual pointer types. But it seems we do /* The alignment needs to be at least according to that of the mode. */ set_mem_align (mem, MAX (GET_MODE_ALIGNMENT (mode), get_pointer_alignment (loc))); anyway? What do we expect __builtin_compare_exchange to do for unaligned inputs? Like typedef int uint __attribute__((aligned((8)))); unsigned short o, n; void very_bad (uint *p) { __builtin_compare_exchange (p, &o, n, 0, 0, 0); } ? Doesn't the above set a wrong alignment? Back to using get_object_alignment - you cannot blindly use build_fold_indirect_ref at least, you could use get_object_alignment on the operand of an ADDR_EXPR address but I am sure we can construct a testcase where that would give a wrong answer, too (maybe not easily without violating the C standards rule that pointers have to be aligned according to their type ... but nobody in practice follows this and the middle-end does not require this either). Thus, the bad news is that it's hard for the middle-end to recover alignment of a memory access that is represented as a builtin function call that takes addresses as parameters (which also makes them address-taken and thus possibly aliased). Didn't Andrew have some patches to introduce a GIMPLE_ATOMIC eventually side-stepping this issue (maybe that used addresses, too)? Richard. > (3) Support for IC, and ICM via the insv pattern is lacking. > I've added a tiny bit of support here, in the form of using > the existing strict_low_part patterns, but most definitely we > could do better. > > (4) The *sethighpartsi and *sethighpartdi_64 patterns ought to be > more different. As is, we can't insert into bits 48-56 of a > DImode quantity, because we don't generate ICM for DImode, > only ICMH. > > (5) Missing support for RISBGZ in the form of an extv/z expander. > The existing *extv/z splitters probably ought to be conditionalized > on !Z10. > > (6) The strict_low_part patterns should allow registers for at > least Z10. The SImode strict_low_part can use LR everywhere. > > (7) RISBGZ could be used for a 3-address constant lshrsi3 before > srlk is available. > > For the GOOD function above, and this patch set, for -O3 -march=z10: > > larl %r3,s+4 > lhrl %r0,o > lhi %r2,1 > l %r1,0(%r3) > nilh %r1,0 > .L2: > lr %r5,%r1 > larl %r12,n > lr %r4,%r1 > risbg %r4,%r0,32,47,16 > icm %r5,3,0(%r12) > cs %r4,%r5,0(%r3) > je .L3 > lr %r5,%r4 > nilh %r5,0 > cr %r5,%r1 > lr %r1,%r5 > jne .L2 > lhi %r2,0 > .L3: > srl %r4,16 > sthrl %r4,o > > Odd things: > > * O is forced into a register before reaching the expander, so we > get the RISBG for that. N is left in a memory and so we commit > to using ICM for that. Further, because of how strict_low_part > is implemented we're committed to leaving that in memory. > > * We don't optimize the loop and hoist the LARL of N outside the loop. > > * Given that we're having to zap the mask in %r1 for the second > compare anyway, I wonder if RISBG is really beneficial over OR. > Is RISBG (or ICM for that matter) any faster (or even smaller)? > > > r~ > > > Richard Henderson (2): > s390: Reorg s390_expand_insv > s390: Convert from sync to atomic optabs > > gcc/config/s390/s390-protos.h | 3 +- > gcc/config/s390/s390.c | 270 ++++++++++++++++++---------- > gcc/config/s390/s390.md | 401 > +++++++++++++++++++++++++++++------------ > 3 files changed, 465 insertions(+), 209 deletions(-) > > -- Richard Guenther <rguent...@suse.de> SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imend