sdf-jkl commented on PR #9414:
URL: https://github.com/apache/arrow-rs/pull/9414#issuecomment-4159269629

   I also now have a very different idea. 
   
   Basically, why deal with averages around densities and selections, when we 
have the real numbers?
   
   For selection materialization we loop through the selectors and count 
non-zero ones:
   
https://github.com/apache/arrow-rs/blob/aa9432c8833f5701085e8b933b30560d21df9f80/parquet/src/arrow/arrow_reader/read_plan.rs#L113-L123
   
   Why don't we count rows covered by long/short selectors instead?
   ```rust
   // total_rows: total rows covered by selection
   // effective_count: non-empty selector runs
   // short_*_rows: rows in runs <= short_threshold
   // long_*_rows: rows in runs >= long_threshold
   let (
       total_rows,
       effective_count,
       short_select_rows,
       short_skip_rows,
       long_select_rows,
       long_skip_rows,
   ) = selection.iter().fold(
       (0usize, 0usize, 0usize, 0usize, 0usize, 0usize),
       |(rows, cnt, ss, sk, ls, lk), s| {
           if s.row_count == 0 {
               return (rows, cnt, ss, sk, ls, lk);
           }
   
           let rows = rows + s.row_count;
           let cnt = cnt + 1;
           let is_short = s.row_count <= short_threshold;
           let is_long = s.row_count >= long_threshold;
   
           match (s.skip, is_short, is_long) {
               (true, true, _) => (rows, cnt, ss, sk + s.row_count, ls, lk),
               (true, _, true) => (rows, cnt, ss, sk, ls, lk + s.row_count),
               (false, true, _) => (rows, cnt, ss + s.row_count, sk, ls, lk),
               (false, _, true) => (rows, cnt, ss, sk, ls + s.row_count, lk),
               _ => (rows, cnt, ss, sk, ls, lk), // middle bin
           }
       },
   );
   
   ```
   Where `long/short_threshold` would be based on the page size or something 
else?
   
   With this statistics/histogram we'd be able to make a more data-driven 
decision on keeping or deferring the selection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to