> 
> The latter takes 5 cycles, the former takes 3 cycles.

It's pipelined however.

> 
> Do you have any microbenchmark or real workloads to show your
> optimization is better?

Keep in mind it only uses one port vs two.

Yes I ran it on Arrow lake and saw wins on both Pcore and Ecore
according to the througputs.  In fact Ecore saw higher wins 
even though it has higher latency (it was surprising to me)

Another advantage is using less registers so more unrolling is possible.

It might be reasonable to tweak the costs per CPU however, I haven't
done that.

BTW for rotate the wins are much higher because there are no native
instructions for it.

-Andi

Reply via email to