On Mon, Feb 07, 2022 at 10:06:36PM -0600, Bill Schmidt wrote: > On 2/7/22 5:05 PM, Segher Boessenkool wrote: > > On Mon, Feb 07, 2022 at 04:20:24PM -0600, Bill Schmidt wrote: > >> I observed recently that a couple of Power10 instructions and built-in > >> functions > >> were somehow not implemented. This patch adds one of them (vmsumcud). > >> Although > >> this isn't normally stage-4 material, this is really simple and carries no > >> discernible risk, so I hope it can be considered. > > But what is the advantage? That will be very tiny as well, afaics? > > > > Ah, this implements a builtin as well. But that builtin is not in the > > PVIPR, so no one yet uses it most likely? > > It's in the yet unpublished version of PVIPR that adds ISA 3.1 support, > currently awaiting public review. It should have been implemented with > the rest of the ISA 3.1 built-ins. (There are two more that were missed > as well, which I haven't yet addressed.)
Ugh. Too much process, not enough speed. > >> +;; vmsumcud > >> +(define_insn "vmsumcud" > >> +[(set (match_operand:V1TI 0 "register_operand" "+v") > >> + (unspec:V1TI [(match_operand:V2DI 1 "register_operand" "v") > >> + (match_operand:V2DI 2 "register_operand" "v") > >> + (match_operand:V1TI 3 "register_operand" "v")] > >> + UNSPEC_VMSUMCUD))] > >> + "TARGET_POWER10" > >> + "vmsumcud %0,%1,%2,%3" > >> + [(set_attr "type" "vecsimple")] > >> +) > > This can be properly described in RTL instead of using an unspec. This > > is much preferable. I would say compare to maddhd[u], but those insns > > aren't implemented either (maddld is though). > > Is it? Note that vmsumcud produces the carry out of the final > result, not the result itself. I couldn't immediately see how > to express this in RTL. It produces thw top 128 bits of the (infinitely precise) result. But yeah that requires an OImode here (for the temp itself), and we do not have that in the backend yet. > The full operation multiplies the corresponding lanes of each > doubleword of arguments 1 and 2, adds them together with the > 128-bit value in argument 3, and produces the carry out of the > result as a 128-bit value in the result. I think I'd need to > have a 256-bit mode to express this properly in RTL, right? Not if you actually calculate the carry, instead of computing the 256-bit result and truncating it. But this is very unwieldy (it would be fine if adding just two datums, but here there are three). Should the type be vecsimple? Don't we have a type for multiplications? Hrm it looks like we use veccomplex usually. Okay for trunk with that taken care of. Thanks! Segher