gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2725390772
> There must be something that happens with this 512 step that doesn't happen otherwise such as using different instructions, loop unrolling, better CPU pipelining or something else. Thanks for pointing out this. I studied the asm profile again and i can see at least loop unrolling differs there. According to the asm printed by jmh, i can see for bpv24 decoding: * VectorAPI unrolled shift loop x8 (add 0x40 once) and remainder loop x4 (add 0x20 once) * InnerLoop 512 step unrolled shift loop x4 (add 0x20 once) and remainder loop x2 (add 0x10 once) * InnerLoop 128 step does not get loop unrolling for either shift loop (add 0x8 once) or remainder loop (add 0x8 once). This is corresponding to the result of jmh: vector API > InnerLoop step-512 > InnerLoop step-128. Things might change in luceneutil because we find InnerLoop step-512 faster than Vector API there. I confirmed the result of luceneutil of step-512(baseline) vs step-128(candidate): ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value FilteredIntNRQ 80.02 (4.0%) 71.31 (3.0%) -10.9% ( -17% - -4%) 0.000 IntNRQ 80.94 (2.5%) 72.60 (3.6%) -10.3% ( -16% - -4%) 0.000 CountFilteredIntNRQ 42.93 (2.9%) 40.22 (2.3%) -6.3% ( -11% - -1%) 0.001 IntSet 93.36 (2.1%) 93.85 (0.7%) 0.5% ( -2% - 3%) 0.633 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org