mac.l instructions

olegendo at gcc dot gnu.org Sun, 15 Jul 2012 05:11:53 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949


--- Comment #3 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-15 12:11:20 
UTC ---
Created attachment 27799
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27799
Proof of concept patch

This is a proof of concept patch just to probe around.
The idea is to allow the RA to allocate macl and mach registers in DImode, and
have mac insns that use the macl/mach regs as a pair in DImode.

With the patch applied, the following function ...

int64_t test01 (const int16_t* a, const int16_t* b)
{
  int64_t sum = 0;
  for (int i = 0; i < 16; ++i)
    sum += (int64_t)(*a++) * (int64_t)(*b++);
  return sum;
}

compiled with -m4 -O2 results in ...

__Z6test01PKsS0_:
.LFB0:
        .cfi_startproc
        mov    #16,r1          ! 88    movsi_ie/3    [length = 2]
        clrmac                  ! 39    clrmac/1    [length = 2]
        .align 2
.L3:
        dt    r1              ! 89    dect    [length = 2]
        bf/s    .L3             ! 90    branch_false    [length = 2]
        mac.w    @r4+,@r5+       ! 61    *macw    [length = 2]
        sts    macl,r0         ! 82    movsi_ie/8    [length = 2]
        rts                     ! 99    *return_i    [length = 2]
        sts    mach,r1         ! 83    movsi_ie/8    [length = 2]

... which is not that bad already.


Some notes I took while playing around with this:


- When compiling for big endian the RA mistakes mach and macl when
  storing mach:macl to a DImode reg:reg pair.
  This could probably fixed by providing appropriate move insns patterns.


- Move insns/splits for DImode mach:macl <-> memory have to be added.
  I've seen an ICE when compiling with -O1:
  error: unrecognizable insn:
  (insn 122 14 15 2 (set (mem/c:DI (plus:SI (reg/f:SI 15 r15)
                (const_int 8 [0x8])) [0 %sfp+-8 S8 A32])
        (reg:DI 148 macl)) sh_mac.cpp:38 -1
     (nil))


- In some cases the mach:macl reg pair gets swapped to a general reg pair
  without any obvious need.  Example function:

  int64_t test04 (const int16_t* a, const int16_t* b,
                  const int16_t* c, const int16_t* d)
  {
    int64_t sum0 = 0;
    int64_t sum1 = 0;
    for (int i = 0; i < 16; ++i)
      sum0 += (int64_t)(*a++) * (int64_t)(*b++);

    for (int i = 0; i < 16; ++i)
      sum1 += (int64_t)(*c++) * (int64_t)(*d++);

    return sum0 - sum1;
  }

  The IRA pass first allocates sum0 and sum1 to mach:macl, but then reload
  seems to think that they are conflicting and moves sum0 to a general regs
  pair.  This results in ...

        mov     #0,r2
        mov     #16,r1
        mov     r2,r3
  .L16:
        lds     r2,macl
        lds     r3,mach
        dt      r1
        mac.w   @r4+,@r5+
        sts     macl,r2
        bf/s    .L16
        sts     mach,r3

        mov     #16,r1
        clrmac
        .align 2
  .L18:
        dt      r1
        bf/s    .L18
        mac.w   @r6+,@r7+


  which would be better as:
        mov     #16,r1
        clrmac
  .L16:
        dt      r1
        bf/s    .L16
        mac.w   @r4+,@r5+

        sts     macl,r2
        sts     mach,r3
        clrmac
        mov     #16,r1
  .L18:
        dt      r1
        bf/s    .L18
        mac.w   @r6+,@r7+


- Loops with multiple running sums like
  for (int i = 0; i < 16; ++i)
  {
    sum0 += (int64_t)(*a++) * (int64_t)(*b++);
    sum1 += (int64_t)(*c++) * (int64_t)(*d++);
  }

  result in macl:mach swapping to general reg pairs between subsequent
  mac.w instructions.  Ideally such loops should be split into multiple
  loops like in the previous example.


- When loop unrolling is turned on the auto-inc addresses refs are
  converted to displacement addresses.  Because the auto-inc-dec pass
  currently fails to detect a lot of auto-inc-dec possibilities the 
  mac.w pattern will not match.
  The same goes for manually unrolled code like 

  sum += (int64_t)(*a++) * (int64_t)(*b++);
  sum += (int64_t)(*a++) * (int64_t)(*b++);


- Running sum variables should be turned into DImode variables if possible:
  int32_t test00 (const int16_t* a, const int16_t* b)
  {
    int32_t sum = 0;
    for (int i = 0; i < 16; ++i)
      sum += (*a++) * (*b++);
    return sum;
  }


- The existing multiplication patterns could be adopted to utilize macl:mach
  reg pair allocation, especially 32x32 -> 64 bit multiplications.


- Normal multiplications that do not need a full MAC operation but use
  memory operands can be done with a clrmac-mac sequence.


Probably there are more subtle issues.  Also, I have not tried expanding
the standard name 'maddmn4' pattern,  maybe it would make some of the
problems mentioned above automagically disappear.

[Bug target/53949] [SH] Add support for mac.w / mac.l instructions

Reply via email to