On Wed, Aug 13, 2025 at 1:40 AM Andi Kleen <a...@firstfloor.org> wrote:
>
> >
> > The latter takes 5 cycles, the former takes 3 cycles.
>
> It's pipelined however.
>
> >
> > Do you have any microbenchmark or real workloads to show your
> > optimization is better?
>
> Keep in mind it only uses one port vs two.
>
> Yes I ran it on Arrow lake and saw wins on both Pcore and Ecore
> according to the througputs.  In fact Ecore saw higher wins
> even though it has higher latency (it was surprising to me)
>
> Another advantage is using less registers so more unrolling is possible.
>
> It might be reasonable to tweak the costs per CPU however, I haven't
> done that.
>
> BTW for rotate the wins are much higher because there are no native
> instructions for it.
For ashl/lshr, the original implementation only takes 2
instructions(vpsllw/vpsrlw + vpand), and for ashr when shift count is
7, it only takes 1 instruction(vpcmpgtb).  .i.e
https://godbolt.org/z/Wef97YqGx
So I'd like to keep the original implementation for them.

For ashr(w/ shift count != 7) and rotate, I agree with your point.
>
> -Andi
>


-- 
BR,
Hongtao

Reply via email to