Re: [PR] Align vectors on disk for optimal vectorized performance? [lucene]

via GitHub Mon, 20 Oct 2025 08:53:56 -0700


mikemccand commented on PR #15341:
URL: https://github.com/apache/lucene/pull/15341#issuecomment-3422688417


   I tested on [`beast3` (nightly benchmarking 
box)](https://blog.mikemccandless.com/2021/01/apache-lucene-performance-on-128-core.html)
 -- a Ryzen Threadripper 3990X:
   
   ```
   processor       : 127
   vendor_id       : AuthenticAMD
   cpu family      : 23
   model           : 49
   model name      : AMD Ryzen Threadripper 3990X 64-Core Processor
   stepping        : 0
   microcode       : 0x830107c
   cpu MHz         : 2900.000
   cache size      : 512 KB
   physical id     : 0
   siblings        : 128
   core id         : 63
   cpu cores       : 64
   apicid          : 127
   initial apicid  : 127
   fpu             : yes
   fpu_exception   : yes
   cpuid level     : 16
   wp              : yes
   flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid 
aperfmperf rapl pni pcl\
   mulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c 
rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 
3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext 
perfctr_llc mwaitx cpb cat_l3 c\
   dp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm 
rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves 
cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr 
rdpru wbnoinvd arat npt lb\
   rv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists 
pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid 
overflow_recov succor smca sev sev_es
   bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass 
retbleed smt_rsb srso ibpb_no_ret spectre_v2_user
   bogomips        : 5788.93
   TLB size        : 3072 4K pages
   clflush size    : 64
   cache_alignment : 64
   address sizes   : 43 bits physical, 48 bits virtual
   power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
   ```
   
   I applied this PR, built (`./gradlew :lucene:benchmark-jmh:assemble`), and 
ran `java --module-path lucene/benchmark-jmh/build/benchmarks --module 
org.apache.lucene.benchmark.jmh VectorScorerBenchmark -p size=256`, and got:
   
   ```
   Benchmark                                      (padBytes)  (size)   Mode  
Cnt   Score   Error   Units
   VectorScorerBenchmark.binaryDotProductDefault           0     256  thrpt   
15   8.601 ± 0.047  ops/us
   VectorScorerBenchmark.binaryDotProductDefault           1     256  thrpt   
15   8.573 ± 0.058  ops/us
   VectorScorerBenchmark.binaryDotProductDefault           2     256  thrpt   
15   8.593 ± 0.023  ops/us
   VectorScorerBenchmark.binaryDotProductDefault           4     256  thrpt   
15   8.579 ± 0.026  ops/us
   VectorScorerBenchmark.binaryDotProductDefault           6     256  thrpt   
15   8.584 ± 0.026  ops/us
   VectorScorerBenchmark.binaryDotProductDefault           8     256  thrpt   
15   8.605 ± 0.019  ops/us
   VectorScorerBenchmark.binaryDotProductDefault          16     256  thrpt   
15   8.603 ± 0.034  ops/us
   VectorScorerBenchmark.binaryDotProductDefault          20     256  thrpt   
15   8.583 ± 0.031  ops/us
   VectorScorerBenchmark.binaryDotProductDefault          32     256  thrpt   
15   8.581 ± 0.030  ops/us
   VectorScorerBenchmark.binaryDotProductDefault          50     256  thrpt   
15   8.591 ± 0.052  ops/us
   VectorScorerBenchmark.binaryDotProductDefault          64     256  thrpt   
15   8.611 ± 0.033  ops/us
   VectorScorerBenchmark.binaryDotProductDefault         100     256  thrpt   
15   8.594 ± 0.051  ops/us
   VectorScorerBenchmark.binaryDotProductDefault         128     256  thrpt   
15   8.620 ± 0.032  ops/us
   VectorScorerBenchmark.binaryDotProductDefault         255     256  thrpt   
15   8.597 ± 0.026  ops/us
   VectorScorerBenchmark.binaryDotProductDefault         256     256  thrpt   
15   8.605 ± 0.056  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg            0     256  thrpt   
15  25.203 ± 1.850  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg            1     256  thrpt   
15  25.961 ± 0.047  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg            2     256  thrpt   
15  25.314 ± 1.959  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg            4     256  thrpt   
15  25.958 ± 0.067  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg            6     256  thrpt   
15  25.295 ± 1.977  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg            8     256  thrpt   
15  26.122 ± 0.073  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg           16     256  thrpt   
15  26.056 ± 0.184  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg           20     256  thrpt   
15  25.848 ± 1.589  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg           32     256  thrpt   
15  25.817 ± 0.417  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg           50     256  thrpt   
15  26.065 ± 0.585  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg           64     256  thrpt   
15  26.045 ± 0.162  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg          100     256  thrpt   
15  26.093 ± 0.061  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg          128     256  thrpt   
15  26.101 ± 0.090  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg          255     256  thrpt   
15  26.028 ± 0.088  ops/us
   VectorScorerBenchmark.binaryDotProductMemSeg          256     256  thrpt   
15  26.027 ± 0.301  ops/us
   VectorScorerBenchmark.floatDotProductDefault            0     256  thrpt   
15  15.241 ± 0.010  ops/us
   VectorScorerBenchmark.floatDotProductDefault            1     256  thrpt   
15  15.169 ± 0.232  ops/us
   VectorScorerBenchmark.floatDotProductDefault            2     256  thrpt   
15  15.230 ± 0.082  ops/us
   VectorScorerBenchmark.floatDotProductDefault            4     256  thrpt   
15  15.231 ± 0.034  ops/us
   VectorScorerBenchmark.floatDotProductDefault            6     256  thrpt   
15  15.229 ± 0.048  ops/us
   VectorScorerBenchmark.floatDotProductDefault            8     256  thrpt   
15  15.216 ± 0.091  ops/us
   VectorScorerBenchmark.floatDotProductDefault           16     256  thrpt   
15  15.278 ± 0.048  ops/us
   VectorScorerBenchmark.floatDotProductDefault           20     256  thrpt   
15  15.058 ± 0.711  ops/us
   VectorScorerBenchmark.floatDotProductDefault           32     256  thrpt   
15  15.192 ± 0.100  ops/us
   VectorScorerBenchmark.floatDotProductDefault           50     256  thrpt   
15  15.300 ± 0.047  ops/us
   VectorScorerBenchmark.floatDotProductDefault           64     256  thrpt   
15  15.257 ± 0.083  ops/us
   VectorScorerBenchmark.floatDotProductDefault          100     256  thrpt   
15  15.272 ± 0.038  ops/us
   VectorScorerBenchmark.floatDotProductDefault          128     256  thrpt   
15  15.144 ± 0.529  ops/us
   VectorScorerBenchmark.floatDotProductDefault          255     256  thrpt   
15  15.248 ± 0.024  ops/us
   VectorScorerBenchmark.floatDotProductDefault          256     256  thrpt   
15  15.276 ± 0.039  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg             0     256  thrpt   
15  20.360 ± 0.077  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg             1     256  thrpt   
15  20.252 ± 0.177  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg             2     256  thrpt   
15  20.281 ± 0.060  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg             4     256  thrpt   
15  20.261 ± 0.048  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg             6     256  thrpt   
15  20.285 ± 0.063  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg             8     256  thrpt   
15  20.359 ± 0.072  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg            16     256  thrpt   
15  20.344 ± 0.078  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg            20     256  thrpt   
15  20.272 ± 0.090  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg            32     256  thrpt   
15  20.413 ± 0.010  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg            50     256  thrpt   
15  20.066 ± 0.051  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg            64     256  thrpt   
15  20.386 ± 0.051  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg           100     256  thrpt   
15  20.029 ± 0.095  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg           128     256  thrpt   
15  20.348 ± 0.049  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg           255     256  thrpt   
15  20.047 ± 0.101  ops/us
   VectorScorerBenchmark.floatDotProductMemSeg           256     256  thrpt   
15  20.335 ± 0.037  ops/us
   ```
   
   Net/net it seems like alignment of the mapped in-ram (virtual address space) 
doesn't matter?
   
   I also tested newer CPU (Raptor Lake) -- I'll post that shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Align vectors on disk for optimal vectorized performance? [lucene]

Reply via email to