sdf-jkl commented on PR #9414: URL: https://github.com/apache/arrow-rs/pull/9414#issuecomment-4159269629
I also now have a very different idea. Basically, why deal with averages around densities and selections, when we have the real numbers? For selection materialization we loop through the selectors and count non-zero ones: https://github.com/apache/arrow-rs/blob/aa9432c8833f5701085e8b933b30560d21df9f80/parquet/src/arrow/arrow_reader/read_plan.rs#L113-L123 Why don't we count rows covered by long/short selectors instead? ```rust // total_rows: total rows covered by selection // effective_count: non-empty selector runs // short_*_rows: rows in runs <= short_threshold // long_*_rows: rows in runs >= long_threshold let ( total_rows, effective_count, short_select_rows, short_skip_rows, long_select_rows, long_skip_rows, ) = selection.iter().fold( (0usize, 0usize, 0usize, 0usize, 0usize, 0usize), |(rows, cnt, ss, sk, ls, lk), s| { if s.row_count == 0 { return (rows, cnt, ss, sk, ls, lk); } let rows = rows + s.row_count; let cnt = cnt + 1; let is_short = s.row_count <= short_threshold; let is_long = s.row_count >= long_threshold; match (s.skip, is_short, is_long) { (true, true, _) => (rows, cnt, ss, sk + s.row_count, ls, lk), (true, _, true) => (rows, cnt, ss, sk, ls, lk + s.row_count), (false, true, _) => (rows, cnt, ss + s.row_count, sk, ls, lk), (false, _, true) => (rows, cnt, ss, sk, ls + s.row_count, lk), _ => (rows, cnt, ss, sk, ls, lk), // middle bin } }, ); ``` Where `long/short_threshold` would be based on the page size or something else? With this statistics/histogram we'd be able to make a more data-driven decision on keeping or deferring the selection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
