On Tue, 7 Oct 2025 02:47:50 GMT, Xiaohong Gong <[email protected]> wrote:

>> Are you referring to the N1 numbers? The add reduction operation has gains 
>> around ~40% while the mul reduction is around ~20% on N1. On V1 and V2 they 
>> look comparable (not considering the cases where we generate `fadda` 
>> instructions for add reduction).
>> 
>>> Seems instructions between different ins instructions will have a 
>>> data-dependence, which is not expected
>> 
>> Why do you think it's not expected? We have the exact same sequence for Neon 
>> add reduction as well. There's back to back dependency there as well and yet 
>> it shows better performance. The N1 optimization guide shows 2 cyc latency 
>> for `fadd` and 3 cyc latency for `fmul`. Could this be the reason? WDYT?
>
> I mean we do not expect there is data-dependence between two `ins` 
> operations, but it has now. We do not recommend use the instructions that 
> just write part of a register. This might involve un-expected dependence 
> between. I suggest to use `ext` instead, and I can observe about 20% 
> performance improvement compared with current version on V2. I did not check 
> the correctness, but it looks right to me. Could you please help check on 
> other machines? Thanks!
> 
> The change might look like:
> Suggestion:
> 
>         fmulh(dst, fsrc, vsrc);
>         ext(vtmp, T8B, vsrc, vsrc, 2);
>         fmulh(dst, dst, vtmp);
>         ext(vtmp, T8B, vsrc, vsrc, 4);
>         fmulh(dst, dst, vtmp);
>         ext(vtmp, T8B, vsrc, vsrc, 6);
>         fmulh(dst, dst, vtmp);
>         if (isQ) {
>           ext(vtmp, T16B, vsrc, vsrc, 8);
>           fmulh(dst, dst, vtmp);
>           ext(vtmp, T16B, vsrc, vsrc, 10);
>           fmulh(dst, dst, vtmp);
>           ext(vtmp, T16B, vsrc, vsrc, 12);
>           fmulh(dst, dst, vtmp);
>           ext(vtmp, T16B, vsrc, vsrc, 14);
>           fmulh(dst, dst, vtmp);

Hi @XiaohongGong Thanks for this suggestion. I understand that `ins` has a 
read-modify-write dependency while `ext` does not have that as we are not 
reading the `vtmp` register in this case.

I made changes to both the add and mul reduction implementation and I could see 
some perf gains on V1 and V2 for mul reduction - 

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40";>

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
<link rel=File-List
href="file:////Users/bhakil01/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">
<style>

</style>
</head>

<body link="#467886" vlink="#96607D">


Benchmark | vectorDim | 8B | 16B
-- | -- | -- | --
Float16OperationsBenchmark.ReductionAddFP16 | 256 | 1.0022509 | 0.99938584
Float16OperationsBenchmark.ReductionAddFP16 | 512 | 1.05157946 | 1.00262025
Float16OperationsBenchmark.ReductionAddFP16 | 1024 | 1.02392196 | 1.00187924
Float16OperationsBenchmark.ReductionAddFP16 | 2048 | 1.01219315 | 0.99964493
Float16OperationsBenchmark.ReductionMulFP16 | 256 | 0.99729809 | 1.19006546
Float16OperationsBenchmark.ReductionMulFP16 | 512 | 1.03897347 | 1.0689105
Float16OperationsBenchmark.ReductionMulFP16 | 1024 | 1.01822982 | 1.01509971
Float16OperationsBenchmark.ReductionMulFP16 | 2048 | 1.0086255 | 1.0032434



</body>

</html>

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2614674991

Reply via email to