[I] Spark 4.1: native_datafusion bytesRead task metric off by 6-14x vs Spark [datafusion-comet]

via GitHub Sun, 03 May 2026 10:35:16 -0700


andygrove opened a new issue, #4194:
URL: https://github.com/apache/datafusion-comet/issues/4194


   Sub-issue of #4098.
   
   ## Description
   
   Three tests fail in \`Spark 4.1, JDK 17/auto [exec]\` 
(\`CometTaskMetricsSuite\`):
   
   - \`native_datafusion scan reports task-level input metrics matching Spark\`
   - \`input metrics aggregate across multiple native scans in a join\`
   - \`input metrics aggregate across multiple native scans in a union\`
   
   Symptom (one example):
   
   \`\`\`
   9.6 was greater than or equal to 0.7, but 9.6 was not less than or equal to 
1.3
   bytesRead ratio out of range: comet=90498, spark=9427, ratio=9.6
   \`\`\`
   
   Two more failures with similar 6.4 and 13.9 ratios.
   
   ## Suspected root cause
   
   Spark 4.1 changed what \`inputMetrics.bytesRead\` accounts for, most likely 
now reports a smaller subset (e.g. only bytes actually read into row buffers, 
versus full Parquet footer plus row group). Compare \`ParquetFileReader\` / 
\`PartitionedFile\` accounting between 4.0 and 4.1 and adjust Comet's metric 
source accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Spark 4.1: native_datafusion bytesRead task metric off by 6-14x vs Spark [datafusion-comet]

Reply via email to