http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #3 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-15 12:11:20 UTC --- Created attachment 27799 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27799 Proof of concept patch This is a proof of concept patch just to probe around. The idea is to allow the RA to allocate macl and mach registers in DImode, and have mac insns that use the macl/mach regs as a pair in DImode. With the patch applied, the following function ... int64_t test01 (const int16_t* a, const int16_t* b) { int64_t sum = 0; for (int i = 0; i < 16; ++i) sum += (int64_t)(*a++) * (int64_t)(*b++); return sum; } compiled with -m4 -O2 results in ... __Z6test01PKsS0_: .LFB0: .cfi_startproc mov #16,r1 ! 88 movsi_ie/3 [length = 2] clrmac ! 39 clrmac/1 [length = 2] .align 2 .L3: dt r1 ! 89 dect [length = 2] bf/s .L3 ! 90 branch_false [length = 2] mac.w @r4+,@r5+ ! 61 *macw [length = 2] sts macl,r0 ! 82 movsi_ie/8 [length = 2] rts ! 99 *return_i [length = 2] sts mach,r1 ! 83 movsi_ie/8 [length = 2] ... which is not that bad already. Some notes I took while playing around with this: - When compiling for big endian the RA mistakes mach and macl when storing mach:macl to a DImode reg:reg pair. This could probably fixed by providing appropriate move insns patterns. - Move insns/splits for DImode mach:macl <-> memory have to be added. I've seen an ICE when compiling with -O1: error: unrecognizable insn: (insn 122 14 15 2 (set (mem/c:DI (plus:SI (reg/f:SI 15 r15) (const_int 8 [0x8])) [0 %sfp+-8 S8 A32]) (reg:DI 148 macl)) sh_mac.cpp:38 -1 (nil)) - In some cases the mach:macl reg pair gets swapped to a general reg pair without any obvious need. Example function: int64_t test04 (const int16_t* a, const int16_t* b, const int16_t* c, const int16_t* d) { int64_t sum0 = 0; int64_t sum1 = 0; for (int i = 0; i < 16; ++i) sum0 += (int64_t)(*a++) * (int64_t)(*b++); for (int i = 0; i < 16; ++i) sum1 += (int64_t)(*c++) * (int64_t)(*d++); return sum0 - sum1; } The IRA pass first allocates sum0 and sum1 to mach:macl, but then reload seems to think that they are conflicting and moves sum0 to a general regs pair. This results in ... mov #0,r2 mov #16,r1 mov r2,r3 .L16: lds r2,macl lds r3,mach dt r1 mac.w @r4+,@r5+ sts macl,r2 bf/s .L16 sts mach,r3 mov #16,r1 clrmac .align 2 .L18: dt r1 bf/s .L18 mac.w @r6+,@r7+ which would be better as: mov #16,r1 clrmac .L16: dt r1 bf/s .L16 mac.w @r4+,@r5+ sts macl,r2 sts mach,r3 clrmac mov #16,r1 .L18: dt r1 bf/s .L18 mac.w @r6+,@r7+ - Loops with multiple running sums like for (int i = 0; i < 16; ++i) { sum0 += (int64_t)(*a++) * (int64_t)(*b++); sum1 += (int64_t)(*c++) * (int64_t)(*d++); } result in macl:mach swapping to general reg pairs between subsequent mac.w instructions. Ideally such loops should be split into multiple loops like in the previous example. - When loop unrolling is turned on the auto-inc addresses refs are converted to displacement addresses. Because the auto-inc-dec pass currently fails to detect a lot of auto-inc-dec possibilities the mac.w pattern will not match. The same goes for manually unrolled code like sum += (int64_t)(*a++) * (int64_t)(*b++); sum += (int64_t)(*a++) * (int64_t)(*b++); - Running sum variables should be turned into DImode variables if possible: int32_t test00 (const int16_t* a, const int16_t* b) { int32_t sum = 0; for (int i = 0; i < 16; ++i) sum += (*a++) * (*b++); return sum; } - The existing multiplication patterns could be adopted to utilize macl:mach reg pair allocation, especially 32x32 -> 64 bit multiplications. - Normal multiplications that do not need a full MAC operation but use memory operands can be done with a clrmac-mac sequence. Probably there are more subtle issues. Also, I have not tried expanding the standard name 'maddmn4' pattern, maybe it would make some of the problems mentioned above automagically disappear.