andygrove opened a new pull request, #4197: URL: https://github.com/apache/datafusion-comet/pull/4197
## Which issue does this PR close? Closes #4194 (sub-issue of #4098). ## Rationale for this change Three tests in \`CometTaskMetricsSuite\` were ignored on Spark 4.1+ because Comet's reported \`bytesRead\` was 6-14× larger than Spark's, breaking the existing 0.7-1.3 ratio assertion. Investigation: Spark 4.1 changed \`ParquetFileFormat\` to pre-open the \`SeekableInputStream\` and read the file footer outside the \`FileScanRDD.compute()\` thread. Spark's \`inputMetrics.bytesRead\` is updated from a Hadoop FileSystem thread-local byte counter (\`SparkHadoopUtil.getFSBytesReadOnThreadCallback\`) that only captures reads on the \`compute()\` thread — so reads serviced by the pre-opened stream's internal buffer go uncounted. | File size | Spark 4.1 \`bytesRead\` vs actual | Comet/Spark ratio | |---|---|---| | Tiny (whole file pre-buffered) | Near 0 | 10-15× | | Small (MB) | Significant under-count | 2-5× | | Medium / Large (10+ MB) | Subsequent row-group reads counted | Closer to 1× | Comet (via DataFusion's \`bytes_scanned\`) reports actual file IO, which is the only truthful number to compare against. The existing tight ratio is unrecoverable on Spark 4.1+ for small files without changing what Comet reports — and changing Comet to under-report would be wrong. ## What changes are included in this PR? - \`CometTaskMetricsSuite\`: extracted a \`assertCometBytesReadInRange\` helper and used it from all three affected tests. On Spark <= 4.0 it keeps the tight 0.7-1.3 ratio. On Spark 4.1+ it asserts only \`cometBytes >= sparkBytes\` and that both are positive. Records read is still checked exactly. - Removed the \`assume(!isSpark41Plus, ...)\` guards from the three previously-skipped tests. - \`docs/source/user-guide/latest/metrics.md\`: added a section explaining the discrepancy, the file-size dependence, and that this is purely observability — \`inputMetrics.bytesRead\` is not consumed by the planner, the optimizer, or AQE, so plan choices and correctness are unaffected. ## How are these changes tested? - Full \`CometTaskMetricsSuite\` (5 tests) passes on **Spark 4.1** locally (was 3 failures + 2 passes before). - Full \`CometTaskMetricsSuite\` (5 tests) passes on **Spark 4.0** locally (V1 path unchanged). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
