Hi Segher,

Thanks for the analysis!


On 2020-02-17 02:12, Segher Boessenkool wrote:
Hi!

On Sun, Feb 16, 2020 at 09:52:12PM +0100, Marcus Geelnard wrote:
   (define_insn "smulhshi3"
     [(set (match_operand:HI 0 "register_operand" "=r")
       (truncate:HI
         (ashiftrt:SI
           (mult:SI
             (sign_extend:SI (match_operand:HI 1 "register_operand" "r"))
             (sign_extend:SI (match_operand:HI 2 "register_operand" "r")))
           (const_int 15))))]
   "TARGET_PACKED_OPS"
   "mulq.h\\t%0, %1, %2")
insn_cost 4 for    21: r86:SI=s1:SI
       REG_DEAD s1:SI
insn_cost 4 for     2: r78:SI=r86:SI
       REG_DEAD r86:SI
insn_cost 4 for    22: r87:SI=s2:SI
       REG_DEAD s2:SI
insn_cost 4 for     4: r80:SI=r87:SI
       REG_DEAD r87:SI
insn_cost 4 for     9: r82:SI=sign_extend(r78:SI#0)
       REG_DEAD r78:SI
insn_cost 4 for    10: r83:SI=sign_extend(r80:SI#0)
       REG_DEAD r80:SI
insn_cost 20 for    11: r84:SI=r82:SI*r83:SI
       REG_DEAD r83:SI
       REG_DEAD r82:SI
insn_cost 8 for    12: r85:SI=r84:SI>>0xf
       REG_DEAD r84:SI
insn_cost 4 for    18: s1:HI=r85:SI#0
       REG_DEAD r85:SI
insn_cost 0 for    19: use s1:HI
21 and 22 remain like this, so that register allocation can use whatever
registers work best here.

Trying 2 -> 9:
Successfully matched this instruction:
Trying 4 -> 10:
Successfully matched this instruction:
As expected here.

Trying 9 -> 11:
Failed to match this instruction:
(set (reg:SI 84)
     (mult:SI (sign_extend:SI (subreg:HI (reg:SI 86) 0))
         (reg:SI 83 [ op2D.1381 ])))

Trying 10 -> 11:
Failed to match this instruction:
(set (reg:SI 84)
     (mult:SI (sign_extend:SI (subreg:HI (reg:SI 87) 0))
         (reg:SI 82 [ op1D.1380 ])))
Neither sign_extend combines with the mult on its own.

Trying 10, 9 -> 11:
Failed to match this instruction:
(set (reg:SI 84)
     (mult:SI (sign_extend:SI (subreg:HI (reg:SI 87) 0))
         (sign_extend:SI (subreg:HI (reg:SI 86) 0))))
And neither do both together.  Do you have an instruction that can do
this?  How expensive is it?


Unfortunately I don't have an instruction specifically for mult:SI (sign_extend:SI (HI)) (sign_extend:SI (HI)).



Trying 11 -> 12:
    11: r84:SI=r82:SI*r83:SI
       REG_DEAD r83:SI
       REG_DEAD r82:SI
    12: r85:SI=r84:SI>>0xf
       REG_DEAD r84:SI
Failed to match this instruction:
(set (reg:SI 85)
     (ashiftrt:SI (mult:SI (reg:SI 82 [ op1D.1380 ])
             (reg:SI 83 [ op2D.1381 ]))
         (const_int 15 [0xf])))
Good ;-)

Trying 9, 11 -> 12:
Failed to match this instruction:
(set (reg:SI 85)
     (ashiftrt:SI (mult:SI (sign_extend:SI (subreg:HI (reg:SI 86) 0))
             (reg:SI 83 [ op2D.1381 ]))
         (const_int 15 [0xf])))
Trying 10, 11 -> 12:
Failed to match this instruction:
(set (reg:SI 85)
     (ashiftrt:SI (mult:SI (sign_extend:SI (subreg:HI (reg:SI 87) 0))
             (reg:SI 82 [ op1D.1380 ]))
         (const_int 15 [0xf])))
Neither of those work, that is fine as well.

But, 9+10+11+12 is not tried.  See
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/combine.c;h=d44b9c3bf950e52dce5089f7297a9fc7fb28dcfd;hb=HEAD#l2728
for why we don't here: most 4-insn combinations (where no 3-insn subset
of it leads to a valid insn) do not go anywhere, and there are *many*
such combinations that can be tried.  Instructions with a binary op with
a constant source count as one "ngood", and instructions that are just a
move from a constant into something count as two "ngood".  Shifts are
counted separately, in "nshift".  So here we get ngood=0 and nshift=1,
but we need one of them to be at least 2 to try a 4-insn combo.

Maybe we should have an "nextend" as well?  Many targets use something
like this for widening multiplies.  I'll try something like that, but
it won't make it into GCC 10 (we're in development stage 4 for it now).

In the meantime, you can add a pattern for the result of 9+10+11:

(set (match_operand:SI ...)
      (mult:SI (sign_extend:SI (match_operand:HI ...))
               (sign_extend:SI (match_operand:HI ...))))

(which you then have to handle, of course, either with a machine insn
if that exists, or some other way, a libcall perhaps; you already have
some way to do mulsi3 I guess?)


Yes, I have mulsi3, but the thing is that I have instructions that exactly match the definition of smulhsqi3/smulhshi3/smulhssi3. They're called MULQ.B, MULQ.H and MULQ (for Q-format fixed point), and they typically have a throughput of 1 operation / cycle. See [1]. I'd really like to find a way to tell gcc to emit those instructions.

If naive C code does not generate a matching pattern (it would be nice if it did, though), is there something else that can be used (e.g. a builtin)?

Again, thanks! I'll dig further.

/Marcus


[1] https://github.com/mrisc32/mrisc32/blob/master/doc/Instructions.md#multiply-and-divide-instructions


Reply via email to