mikemccand commented on PR #15341: URL: https://github.com/apache/lucene/pull/15341#issuecomment-3422688417
I tested on [`beast3` (nightly benchmarking box)](https://blog.mikemccandless.com/2021/01/apache-lucene-performance-on-128-core.html) -- a Ryzen Threadripper 3990X: ``` processor : 127 vendor_id : AuthenticAMD cpu family : 23 model : 49 model name : AMD Ryzen Threadripper 3990X 64-Core Processor stepping : 0 microcode : 0x830107c cpu MHz : 2900.000 cache size : 512 KB physical id : 0 siblings : 128 core id : 63 cpu cores : 64 apicid : 127 initial apicid : 127 fpu : yes fpu_exception : yes cpuid level : 16 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pcl\ mulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 c\ dp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lb\ rv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret spectre_v2_user bogomips : 5788.93 TLB size : 3072 4K pages clflush size : 64 cache_alignment : 64 address sizes : 43 bits physical, 48 bits virtual power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14] ``` I applied this PR, built (`./gradlew :lucene:benchmark-jmh:assemble`), and ran `java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh VectorScorerBenchmark -p size=256`, and got: ``` Benchmark (padBytes) (size) Mode Cnt Score Error Units VectorScorerBenchmark.binaryDotProductDefault 0 256 thrpt 15 8.601 ± 0.047 ops/us VectorScorerBenchmark.binaryDotProductDefault 1 256 thrpt 15 8.573 ± 0.058 ops/us VectorScorerBenchmark.binaryDotProductDefault 2 256 thrpt 15 8.593 ± 0.023 ops/us VectorScorerBenchmark.binaryDotProductDefault 4 256 thrpt 15 8.579 ± 0.026 ops/us VectorScorerBenchmark.binaryDotProductDefault 6 256 thrpt 15 8.584 ± 0.026 ops/us VectorScorerBenchmark.binaryDotProductDefault 8 256 thrpt 15 8.605 ± 0.019 ops/us VectorScorerBenchmark.binaryDotProductDefault 16 256 thrpt 15 8.603 ± 0.034 ops/us VectorScorerBenchmark.binaryDotProductDefault 20 256 thrpt 15 8.583 ± 0.031 ops/us VectorScorerBenchmark.binaryDotProductDefault 32 256 thrpt 15 8.581 ± 0.030 ops/us VectorScorerBenchmark.binaryDotProductDefault 50 256 thrpt 15 8.591 ± 0.052 ops/us VectorScorerBenchmark.binaryDotProductDefault 64 256 thrpt 15 8.611 ± 0.033 ops/us VectorScorerBenchmark.binaryDotProductDefault 100 256 thrpt 15 8.594 ± 0.051 ops/us VectorScorerBenchmark.binaryDotProductDefault 128 256 thrpt 15 8.620 ± 0.032 ops/us VectorScorerBenchmark.binaryDotProductDefault 255 256 thrpt 15 8.597 ± 0.026 ops/us VectorScorerBenchmark.binaryDotProductDefault 256 256 thrpt 15 8.605 ± 0.056 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 0 256 thrpt 15 25.203 ± 1.850 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 1 256 thrpt 15 25.961 ± 0.047 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 2 256 thrpt 15 25.314 ± 1.959 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 4 256 thrpt 15 25.958 ± 0.067 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 6 256 thrpt 15 25.295 ± 1.977 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 8 256 thrpt 15 26.122 ± 0.073 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 16 256 thrpt 15 26.056 ± 0.184 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 20 256 thrpt 15 25.848 ± 1.589 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 32 256 thrpt 15 25.817 ± 0.417 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 50 256 thrpt 15 26.065 ± 0.585 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 64 256 thrpt 15 26.045 ± 0.162 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 100 256 thrpt 15 26.093 ± 0.061 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 128 256 thrpt 15 26.101 ± 0.090 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 255 256 thrpt 15 26.028 ± 0.088 ops/us VectorScorerBenchmark.binaryDotProductMemSeg 256 256 thrpt 15 26.027 ± 0.301 ops/us VectorScorerBenchmark.floatDotProductDefault 0 256 thrpt 15 15.241 ± 0.010 ops/us VectorScorerBenchmark.floatDotProductDefault 1 256 thrpt 15 15.169 ± 0.232 ops/us VectorScorerBenchmark.floatDotProductDefault 2 256 thrpt 15 15.230 ± 0.082 ops/us VectorScorerBenchmark.floatDotProductDefault 4 256 thrpt 15 15.231 ± 0.034 ops/us VectorScorerBenchmark.floatDotProductDefault 6 256 thrpt 15 15.229 ± 0.048 ops/us VectorScorerBenchmark.floatDotProductDefault 8 256 thrpt 15 15.216 ± 0.091 ops/us VectorScorerBenchmark.floatDotProductDefault 16 256 thrpt 15 15.278 ± 0.048 ops/us VectorScorerBenchmark.floatDotProductDefault 20 256 thrpt 15 15.058 ± 0.711 ops/us VectorScorerBenchmark.floatDotProductDefault 32 256 thrpt 15 15.192 ± 0.100 ops/us VectorScorerBenchmark.floatDotProductDefault 50 256 thrpt 15 15.300 ± 0.047 ops/us VectorScorerBenchmark.floatDotProductDefault 64 256 thrpt 15 15.257 ± 0.083 ops/us VectorScorerBenchmark.floatDotProductDefault 100 256 thrpt 15 15.272 ± 0.038 ops/us VectorScorerBenchmark.floatDotProductDefault 128 256 thrpt 15 15.144 ± 0.529 ops/us VectorScorerBenchmark.floatDotProductDefault 255 256 thrpt 15 15.248 ± 0.024 ops/us VectorScorerBenchmark.floatDotProductDefault 256 256 thrpt 15 15.276 ± 0.039 ops/us VectorScorerBenchmark.floatDotProductMemSeg 0 256 thrpt 15 20.360 ± 0.077 ops/us VectorScorerBenchmark.floatDotProductMemSeg 1 256 thrpt 15 20.252 ± 0.177 ops/us VectorScorerBenchmark.floatDotProductMemSeg 2 256 thrpt 15 20.281 ± 0.060 ops/us VectorScorerBenchmark.floatDotProductMemSeg 4 256 thrpt 15 20.261 ± 0.048 ops/us VectorScorerBenchmark.floatDotProductMemSeg 6 256 thrpt 15 20.285 ± 0.063 ops/us VectorScorerBenchmark.floatDotProductMemSeg 8 256 thrpt 15 20.359 ± 0.072 ops/us VectorScorerBenchmark.floatDotProductMemSeg 16 256 thrpt 15 20.344 ± 0.078 ops/us VectorScorerBenchmark.floatDotProductMemSeg 20 256 thrpt 15 20.272 ± 0.090 ops/us VectorScorerBenchmark.floatDotProductMemSeg 32 256 thrpt 15 20.413 ± 0.010 ops/us VectorScorerBenchmark.floatDotProductMemSeg 50 256 thrpt 15 20.066 ± 0.051 ops/us VectorScorerBenchmark.floatDotProductMemSeg 64 256 thrpt 15 20.386 ± 0.051 ops/us VectorScorerBenchmark.floatDotProductMemSeg 100 256 thrpt 15 20.029 ± 0.095 ops/us VectorScorerBenchmark.floatDotProductMemSeg 128 256 thrpt 15 20.348 ± 0.049 ops/us VectorScorerBenchmark.floatDotProductMemSeg 255 256 thrpt 15 20.047 ± 0.101 ops/us VectorScorerBenchmark.floatDotProductMemSeg 256 256 thrpt 15 20.335 ± 0.037 ops/us ``` Net/net it seems like alignment of the mapped in-ram (virtual address space) doesn't matter? I also tested newer CPU (Raptor Lake) -- I'll post that shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
