expani opened a new pull request, #13521:
URL: https://github.com/apache/lucene/pull/13521

   ### Background
   
   Lucene uses 3 different ways of storing the docIds in KDD file of a BKD Tree 
based index if the docIds in a leaf block are not sorted : 
   
   - If the difference b/w min and max docId in a block ( mostly 512 docIds in 
a block ) can be represented in 16 bits, then it packs 2 docIds in a single 
integer ( BPV_16 ) 
   - If the max in a block can be represented in 24 bits, then it packs 8 
docIds in 3 longs ( BPV_24 )
   - The last one just writes the docIds as integers ( BPV_32 )
    
   
   BPV_24 uses less number of bitwise/arithmetic operations compared to BPV_16. 
Even if we represent the docIds using any other number of bits like 
21/22/23/25/26/27/28/29/30/31, the number of bitwise operations used for 
encoding and decoding will increase. Let's compare 2 encoding schemes as an 
example, 
   
   Note `|` represents one packed long and `,` separates the individual docIds 
in the packed longs. 
   Only decoding is shown in the example, but the same applies for encoding as 
well.
   
   For BPV_24
   ```
   | 24,24,16 | 8,24,24,8 | 16,24,24 | 
   ```
   Packs 8 docIds in 3 longs.So, 512 docIds in ( 3/8 * 512 ) = 192 longs 
   It uses 18 bitwise operations to decode 8 docIds. So, 512 docIds require ( 
18/8 * 512 ) = 1152 bitwise operations. 
   The bitwise operators used for decoding 8 docIds can also be visualised as 
follows : 
   ```
   S, MS, MSSO, MS, MS, MSSO, MS, M
   ```
   `M` is mask using AND, `S` is bitwise left/right shift and `O` is logical OR 
to join 2 partial halves present in different longs. 
   
   For BPV_20
   ```
   | 20,20,20,4 | 16,20,20,8 | 12,20,20,12 | 8,20,20,16 | 4,20,20,20 | 
   ```
   Packs 16 docIds in 5 longs. So, 512 docIds require ( 5/16 *512 ) = 160 longs
   However, it will use 38 bitwise operations to decode 16 docIds. So, 512 
docIds require ( 38/16 * 512 ) = 1216 bitwise ops. 
   ```
   S, MS, MS, MSSO, MS, MS, MSSO, MS, MS, MSSO, MS, MS, MSSO, MS, MS, M
   ```
   
   I have analysed the same for other BPV like `21/22/23/25/26/27/28/29/30/31` 
and in all cases the number of bitwise operations for encoding and decoding is 
higher than BPV_24. 
   
   ### Solution
   
   While analysing for BPV_21, I observed that if we just pack 3 docIds in a 
long, then number of bitwise operations in encoding and decoding can be reduced 
to be less than BPV_24. The extra bit can be kept at leftmost position ( MSB ) 
as `0` to reduce the number of operations. 
   
   ```
   | 1,21,21,21 | 1,21,21,21 | 1,21,21,21 | 
   ```
   Decoding
   ```
   S, MS, M, S, MS, M, S, MS, M
   ```
   In this case, it requires 12 bitwise operations to decode 9 docIds. So, 512 
docIds will require ( 12/9 * 512 ) ~ 683 bitwise ops. 
   
   It will store 9 docIds in 3 packed longs. So, 512 docIds require ( 3/9 *512 
) ~ 171 longs. This will reduce the number of longs required for such leaves 
compared to BPV_24 by 21 (192-171)
   
   ### Micro Benchmark 
   
   Since, introducing BPV_21 will compete with BPV_24, I wrote a micro 
benchmark to compare the encoding and decoding speeds of both these variations.
   
   `Bit21With2StepsAddEncoder` and `Bit21With3StepsAddEncoder` both perform 
encode/decode using the proposed BPV_21 format.
   `Bit24Encoder` is the exact replica of BPV_24 used in Lucene today. 
   
   Java version Used
   ```
   openjdk version "22.0.1" 2024-04-16
   OpenJDK Runtime Environment Corretto-22.0.1.8.1 (build 22.0.1+8-FR)
   OpenJDK 64-Bit Server VM Corretto-22.0.1.8.1 (build 22.0.1+8-FR, mixed mode, 
sharing)
   ```
   
   Input to the benchmark : 
   ```
           private static final String USAGE = "\n USAGE " +
                   "\n <1> Input Data File Path " +
                   "\n<2> Output Directory Path " +
                   "\n<3> Number of Iterations " +
                   "\n<4> Encoder name \n" +
                   "\n<5> Input Scale Factor";
   ```
   
   Sample run command
   ```
   > nohup.out && nohup java -Xms6G -Xmx6G -cp 
<PathToJar>/lucene-core-10.0.0-SNAPSHOT.jar 
org.apache.lucene.util.bkd.docIds.DocIdWriterBenchmark 
<PathToInput>/finalfile_512.txt <OutDir> Bit24Encoder 10 1000 & 
   ```
   
   The data used in the benchmark is 
`lucene/core/src/java/org/apache/lucene/util/bkd/docIds/data/finalfile.txt` 
which contains all docId sequences that can be represented in 21 bits ( Max is 
<= 0x001FFFFFL aka 20,97,151 ). This has been extracted from first 10 million 
docs of [NYC Taxi data](https://dbyiw3u3rf9yr.cloudfront.net/corpora/nyc_taxis) 
by only indexing the field `fare_amount` as a double point. 
   
   There are 6509 docId sequences in the input file and 6493 of them contain 
512 docIds each. There are a total of 33,28,103 docIds in those 6509 sequences. 
   `Input scale factor` multiplies the number of docIds sequences by the given 
factor to increase the load for the benchmark. 
   
   The script below was executed 5 times and the numbers are the average of 
those runs. 
   `10` is the number of iterations for the encoder and `1000` is the input 
scale factor. 
   ```
   for i in `seq 1 10`
   do
   
    echo -e "\nRunning benchmark for Bit24Encoder at "`date +'%F %T'`"\n"
    java -Xms6G -Xmx6G -cp ./lucene-core-10.0.0-SNAPSHOT.jar 
org.apache.lucene.util.bkd.docIds.DocIdWriterBenchmark ./finalfile.txt ./Out/ 
Bit24Encoder 10 1000
   
    echo -e "\nRunning benchmark for Bit21With2StepsAddEncoder at "`date +'%F 
%T'`"\n"
    java -Xms6G -Xmx6G -cp ./lucene-core-10.0.0-SNAPSHOT.jar 
org.apache.lucene.util.bkd.docIds.DocIdWriterBenchmark ./finalfile.txt ./Out/ 
Bit21With2StepsAddEncoder 10 1000
   
    echo -e "\nRunning benchmark for Bit21With3StepsAddEncoder at "`date +'%F 
%T'`"\n"
    java -Xms6G -Xmx6G -cp ./lucene-core-10.0.0-SNAPSHOT.jar 
org.apache.lucene.util.bkd.docIds.DocIdWriterBenchmark ./finalfile.txt ./Out/ 
Bit21With3StepsAddEncoder 10 1000
   
   done
   ```
   
   Write latencies numbers exhibited 2 patterns, one when EBS Write latency 
peaked and other during normal EBS latency. Both the variations are captured 
below. 
   
   Architecture | Instance Type | Encoder | Decode Latency | Encode Low Write 
Latency | Encode High Write Latency
   -- | -- | -- | -- | -- | --
   x86_64 | m5.xlarge | Bit24Encoder | 3303 ms | 13023 ms | 50599 ms
   x86_64 | m5.xlarge | Bit21With2StepsAddEncoder | 3037 ms | 11680 ms | 42576 
ms
   x86_64 | m5.xlarge | Bit21With3StepsAddEncoder | 3081 ms | 11101 ms | 42610 
ms
   aarch64 | r6g.large | Bit24Encoder | 3454 ms | 16208 ms | 78946 ms
   aarch64 | r6g.large | Bit21With2StepsAddEncoder | 2954 ms | 14968 ms | 65165 
ms
   aarch64 | r6g.large | Bit21With3StepsAddEncoder | 2792 ms | 15777 ms | 65177 
ms
   
   
   There was also size reduction of around 200 MB in kdd file when indexing 
entire NYC Taxi data with this change. 
   
   ### Next steps 
   
   Need inputs from the maintainers and contributors on this new BPV format and 
other benchmarks that need to be executed ( probably luceneutil ? ) to justify 
that this change doesn't cause any like regressions as [seen in SIMD 
Optimisation](https://github.com/apache/lucene/pull/706). 
   
   After feedback, I will fix build failures, add UTs and remove the 
`org.apache.lucene.util.bkd.docIds` package for this to be in a state to be 
merged. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to