alamb commented on PR #9574:
URL: https://github.com/apache/arrow-rs/pull/9574#issuecomment-4172905627

   BTW Codex says it found issues with this PR; I haven't had time to review 
them carefully but I would like to before we merge this. I hope to find time 
for this PR over the next few days
   
   <details><summary>Details</summary>
   <p>
   
   • Findings
   
     - High: LargeUtf8 filtering is broken, and LargeBinary is very likely 
broken for the same reason. In 
parquet/src/arrow/arrow_reader/dictionary_pruning.rs:177 the code
       handles both Utf8 and LargeUtf8 by building a StringArray, and in 
parquet/src/arrow/arrow_reader/dictionary_pruning.rs:192 the fallback always 
builds a BinaryArray.
       But the synthetic schema uses the real Arrow type at 
parquet/src/arrow/arrow_reader/dictionary_pruning.rs:123, so 
RecordBatch::try_new fails for LargeUtf8/
       LargeBinary. I reproduced this in a disposable worktree with an async 
row-filter test; it panicked with: expected LargeUtf8 but found Utf8 at column 
index 0.
     - High: the fallback “all pages are dictionary encoded” check is not safe 
and can return wrong query results on files without page encoding stats. The 
code explicitly
       falls back to a heuristic in 
parquet/src/arrow/arrow_reader/dictionary_pruning.rs:234 and treats 
Encoding::PLAIN as acceptable at parquet/src/arrow/arrow_reader/
       dictionary_pruning.rs:253. But PLAIN is exactly the ambiguous case for 
dictionary fallback pages. The metadata docs in 
parquet/src/file/metadata/mod.rs:1086 say
       page encoding stats were added specifically to answer this question 
safely. If a column chunk contains both dictionary pages and plain fallback 
data pages and lacks
       stats, this code can incorrectly prune rows.
     - Medium: the new logic is not protected by meaningful tests. I reverted 
the PR’s implementation files in a temporary worktree and reran the existing 
decoder-focused
       tests; they still passed. That means the current tests do not fail when 
dictionary pruning is removed. Coverage confirms this: 
target/llvm-cov/html/coverage/Users/
       
andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html
 is only 34.06% line-covered, with the BYTE_ARRAY path (target/llvm-cov/html/
       
coverage/Users/andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html:142),
 the full page-stats path (target/llvm-cov/html/coverage/
       
Users/andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html:217),
 the heuristic fallback (target/llvm-cov/html/coverage/Users/
       
andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html:240),
 and all INT32 conversion logic (target/llvm-cov/html/coverage/Users/
       
andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html:269)
 uncovered.
   
     Public API: the PR adds ArrowPredicate::evaluate_dictionary and exports 
DictionaryPredicateResult from parquet/src/arrow/arrow_reader/filter.rs:25. 
That is a public
     API expansion, but it does not look breaking because the new trait method 
has a default implementation.
   
     Verification
   
     I ran cargo llvm-cov --html test -p parquet; the full parquet suite passed 
and the report is under target/llvm-cov/html/index.html. I also did an ablative 
run in a
     disposable worktree by reverting the PR’s implementation files and running 
cargo test -p parquet test_decoder_ -- --nocapture; all 13 decoder tests still 
passed.
     Finally, I added a temporary repro test in another disposable worktree and 
confirmed the LargeUtf8 failure above.
   
   
   </p>
   </details> 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to