vsop-479 commented on PR #12846: URL: https://github.com/apache/lucene/pull/12846#issuecomment-1842384044
> is this new approach helping cover more cases maybe? Yes, Previous patch only apply copy for level 0 's acc in ``Lucene99SkipWriter.bufferSkip``, Current patch apply copy for all level's acc in ``Lucene99SkipWriter.writeSkipData`` by adding a flag to indicates wether this is an empty acc. > when knowing when the accumulator is empty is the responsibility of the caller rather than CompetitiveFreqNormAccumulator. I will move the copy/addAll selection to caller. Before that, there are some performance data I want to share: First of all, I measured method execution time(without assertion): Method | ns | diff -- | -- | -- addAll | 48291 | - copy | 17000 | +64.7% And add the implementation to JMH to enable JIT's optimization(auto-unroll, auto-vectorization): ```` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.accAddAll 256 thrpt 15 17.311 ± 4.468 ops/us VectorUtilBenchmark.accCopy 256 thrpt 15 21.417 ± 7.009 ops/us Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.accAddAll 256 thrpt 15 19.943 ± 4.831 ops/us VectorUtilBenchmark.accCopy 256 thrpt 15 16.516 ± 1.904 ops/us ```` The speedup is not stable, so I split the implementation into array part and treeSet part. 1. Array part: ```` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.arrayCopy 256 thrpt 15 141.356 ± 3.640 ops/us VectorUtilBenchmark.arrayMax 256 thrpt 15 77.875 ± 1.310 ops/us Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.arrayCopy 256 thrpt 15 138.484 ± 5.038 ops/us VectorUtilBenchmark.arrayMax 256 thrpt 15 78.250 ± 0.992 ops/us ```` ArrayMax is the baseline that iterates the array and get the max value. ArrayCopy uses ``System.arraycopy``, which get a explicit speedup. 2. TreeSet part: ```` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.treeAddAll 256 thrpt 15 27.673 ± 19.096 ops/us VectorUtilBenchmark.treeScalarAdd 256 thrpt 15 34.888 ± 14.321 ops/us Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.treeAddAll 256 thrpt 15 25.342 ± 4.151 ops/us VectorUtilBenchmark.treeScalarAdd 256 thrpt 15 42.673 ± 19.396 ops/us ```` I also measured the method performance without JMH: Method | ns | diff -- | -- | -- treeScalarAdd | 36000 | - treeAddAll | 26875 | +25.3% The method performance indicates that `TreeSet.addAll` get a better performance, but JMH benchmark result indicates that scalarAdd(baseline) get a better performance than `TreeSet.addAll`, which is unexpected for me. I am still working on finding out the reason. Finally, I modified `acc.copy` to just copy the array but use the original iterate add for treeSet, and measured the performance: ```` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.accAddAll 256 thrpt 15 16.180 ± 3.014 ops/us VectorUtilBenchmark.accCopy 256 thrpt 15 31.807 ± 12.811 ops/us Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.accAddAll 256 thrpt 15 20.414 ± 3.693 ops/us VectorUtilBenchmark.accCopy 256 thrpt 15 24.432 ± 6.862 ops/us ```` Now `acc.copy` get a speedup. Any idea about that? @jpountz -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org