Re: [PR] GH-47973: [C++][Parquet] Fix invalid Parquet files written when dictionary encoded pages are large [arrow]

via GitHub Wed, 29 Oct 2025 20:12:24 -0700


adamreeve commented on PR #47998:
URL: https://github.com/apache/arrow/pull/47998#issuecomment-3465950738


   I'm not sure this is the best approach for fixing this because it does slow 
down the RLE encoding benchmarks on my machine. Although it also seems to speed 
up some test cases:
   ```
   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Non-regressions: (9)
   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                 benchmark        baseline       contender  
change %                                                                        
                                                                                
                                                counters
                    BM_RleEncoding/32768/1   2.866 GiB/sec   3.402 GiB/sec    
18.680                                           {'family_index': 0, 
'per_family_instance_index': 2, 'run_name': 'BM_RleEncoding/32768/1', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 33348}
                     BM_RleEncoding/4096/1   2.857 GiB/sec   3.363 GiB/sec    
17.740                                           {'family_index': 0, 
'per_family_instance_index': 1, 'run_name': 'BM_RleEncoding/4096/1', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 263675}
                    BM_RleEncoding/65536/1   2.895 GiB/sec   3.381 GiB/sec    
16.763                                           {'family_index': 0, 
'per_family_instance_index': 3, 'run_name': 'BM_RleEncoding/65536/1', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16491}
                     BM_RleEncoding/1024/1   2.775 GiB/sec   3.114 GiB/sec    
12.186                                          {'family_index': 0, 
'per_family_instance_index': 0, 'run_name': 'BM_RleEncoding/1024/1', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1047638}
   BM_RleEncodingSpacedBoolean/32768/10000  62.305 GiB/sec  62.721 GiB/sec     
0.668 {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 
'BM_RleEncodingSpacedBoolean/32768/10000', 'repetitions': 1, 
'repetition_index': 0, 'threads': 1, 'iterations': 1427195, 'null_percent': 
100.0}
                BM_RleEncodingBoolean/1024 410.025 MiB/sec 406.502 MiB/sec    
-0.859                                      {'family_index': 0, 
'per_family_instance_index': 0, 'run_name': 'BM_RleEncodingBoolean/1024', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 294385}
                BM_RleEncodingBoolean/4096 426.461 MiB/sec 421.442 MiB/sec    
-1.177                                       {'family_index': 0, 
'per_family_instance_index': 1, 'run_name': 'BM_RleEncodingBoolean/4096', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 76216}
               BM_RleEncodingBoolean/32768 435.241 MiB/sec 425.014 MiB/sec    
-2.350                                       {'family_index': 0, 
'per_family_instance_index': 2, 'run_name': 'BM_RleEncodingBoolean/32768', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9715}
               BM_RleEncodingBoolean/65536 435.624 MiB/sec 425.030 MiB/sec    
-2.432                                       {'family_index': 0, 
'per_family_instance_index': 3, 'run_name': 'BM_RleEncodingBoolean/65536', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4871}
   
   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Regressions: (12)
   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                benchmark        baseline       contender  
change %                                                                        
                                                                                
                                           counters
                   BM_RleEncoding/4096/16 644.740 MiB/sec 604.668 MiB/sec    
-6.215                                      {'family_index': 0, 
'per_family_instance_index': 9, 'run_name': 'BM_RleEncoding/4096/16', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 57555}
                  BM_RleEncoding/65536/16 646.133 MiB/sec 601.672 MiB/sec    
-6.881                                     {'family_index': 0, 
'per_family_instance_index': 11, 'run_name': 'BM_RleEncoding/65536/16', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3644}
                  BM_RleEncoding/32768/16 649.870 MiB/sec 601.706 MiB/sec    
-7.411                                     {'family_index': 0, 
'per_family_instance_index': 10, 'run_name': 'BM_RleEncoding/32768/16', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7281}
                   BM_RleEncoding/1024/16 650.748 MiB/sec 601.229 MiB/sec    
-7.610                                     {'family_index': 0, 
'per_family_instance_index': 8, 'run_name': 'BM_RleEncoding/1024/16', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 230323}
   BM_RleEncodingSpacedBoolean/32768/5000 228.351 MiB/sec 207.755 MiB/sec    
-9.020 {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 
'BM_RleEncodingSpacedBoolean/32768/5000', 'repetitions': 1, 'repetition_index': 
0, 'threads': 1, 'iterations': 5085, 'null_percent': 50.0}
   BM_RleEncodingSpacedBoolean/32768/1000 172.863 MiB/sec 154.948 MiB/sec   
-10.364 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 
'BM_RleEncodingSpacedBoolean/32768/1000', 'repetitions': 1, 'repetition_index': 
0, 'threads': 1, 'iterations': 3833, 'null_percent': 10.0}
      BM_RleEncodingSpacedBoolean/32768/1 175.361 MiB/sec 155.387 MiB/sec   
-11.390    {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 
'BM_RleEncodingSpacedBoolean/32768/1', 'repetitions': 1, 'repetition_index': 0, 
'threads': 1, 'iterations': 3953, 'null_percent': 0.01}
    BM_RleEncodingSpacedBoolean/32768/100 173.656 MiB/sec 153.307 MiB/sec   
-11.718   {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 
'BM_RleEncodingSpacedBoolean/32768/100', 'repetitions': 1, 'repetition_index': 
0, 'threads': 1, 'iterations': 3910, 'null_percent': 1.0}
                   BM_RleEncoding/32768/8 650.268 MiB/sec 573.097 MiB/sec   
-11.868                                       {'family_index': 0, 
'per_family_instance_index': 6, 'run_name': 'BM_RleEncoding/32768/8', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7267}
                    BM_RleEncoding/1024/8 644.999 MiB/sec 566.453 MiB/sec   
-12.178                                      {'family_index': 0, 
'per_family_instance_index': 4, 'run_name': 'BM_RleEncoding/1024/8', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 232475}
                    BM_RleEncoding/4096/8 653.119 MiB/sec 570.380 MiB/sec   
-12.668                                       {'family_index': 0, 
'per_family_instance_index': 5, 'run_name': 'BM_RleEncoding/4096/8', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 56561}
                   BM_RleEncoding/65536/8 651.448 MiB/sec 567.123 MiB/sec   
-12.944                                       {'family_index': 0, 
'per_family_instance_index': 7, 'run_name': 'BM_RleEncoding/65536/8', 
'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3611}
   ```
   
   I tried some alternative approaches that avoid the cast to int64 and operate 
on int number of bytes instead of bits, but those were similarly slower and 
could also potentially still overflow if the max buffer size was close to int32 
max. Eg:
   ```C++
     int new_bit_offset = bit_offset_ + num_bits;
     if (ARROW_PREDICT_FALSE(byte_offset_ +
                                 (new_bit_offset == 0 ? 0 : (1 + 
(new_bit_offset - 1) / 8)) >
                             max_bytes_))
   ```
   
   Another alternative solution could be to limit the maximum buffer size to 
something like `int32_max / 8 - 8` which I think should also prevent any 
overflow without needing to change this if condition.
   
   I also noticed that the return value from `BitWriter::PutValue` only appears 
to be used in debug builds, so it could possibly make sense to make this check 
only enabled in debug builds too. But removing this check for release builds 
would mean that rather than silently failing to write out some data, there 
could be invalid memory writes. And the encoders are public so could be used 
outside of this codebase by consumers that do check that the writes succeed, 
and that would be a breaking change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-47973: [C++][Parquet] Fix invalid Parquet files written when dictionary encoded pages are large [arrow]

Reply via email to