Hi Segher,

Thanks for reviewing the patch!

> > +/* { dg-final { scan-assembler-times {\mvaddudm\M} 1 } } */
> > +/* { dg-final { scan-assembler-times {\mvadduwm\M} 1 } } */
> > +/* { dg-final { scan-assembler-times {\mvadduhm\M} 1 } } */
> > +/* { dg-final { scan-assembler-times {\mvaddubm\M} 1 } } */
> 
> You could do something like
> /* { dg-final { scan-assembler-times {\mvaddudm\M} 1 { target
> has_arch_pwr8 } } } */
> 
> I never know what exactly is wanted or needed there.  Just try it
> out?
This worked, thanks.

> As a follow-up, you could also handle muls by four (shifts by two) by
> doing two consecutive vadds, or muls by 3?  But that is an extra
> thing
> (and the mul by 4 is not so obviously an optimisation always!)

Will check the performance, I need to figure out at what point splat
and shift would do better than a sequence of vadds.
For shifts by 2 (or mult by 2^n, n>=2) and so on, splat and shift would
be either equal to or more in insn count.
>From perspective of max latency, 2 vaddudms would be better. 
But I was just wondering, in a loop the 2 would give a similar
performance right since throughput of both would be the same, and the
advantage would be in a simple sequential basic block.

> So, please fix the vaddudm tests, and the formatting nits.  You're
> almost there!

Sure, I have sent v7 of the patch.

Thanks and regards,
Avinash Jayakar

Reply via email to