saitharun15 commented on code in PR #11615: URL: https://github.com/apache/iceberg/pull/11615#discussion_r1854296151
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ########## @@ -194,10 +205,40 @@ protected Statistics estimateStatistics(Snapshot snapshot) { Map<NamedReference, ColumnStatistics> colStatsMap = Collections.emptyMap(); if (readConf.reportColumnStats() && cboEnabled) { colStatsMap = Maps.newHashMap(); + Map<Integer, Long> ndvs = Maps.newHashMap(); + Map<Integer, Long> nullCounts = Maps.newHashMap(); + Map<Integer, Object> minValues = Maps.newHashMap(); + Map<Integer, Object> maxValues = Maps.newHashMap(); List<StatisticsFile> files = table.statisticsFiles(); if (!files.isEmpty()) { List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); + if (readConf.deriveStatsFromManifestSessionConf() + || readConf.deriveStatsFromManifestTableProperty()) { + Map<String, Map<Integer, Long>> distinctDataFilesNullCount = Maps.newHashMap(); Review Comment: We found that different FileScanTask objects in the taskGroup were pointing to the same data file, causing duplicates. I used putIfAbsent to ensure a single entry per file and renamed the maps to distinctDataFiles. However after switching to collect FileScanTask objects as a set, putIfAbsent is no longer needed. @RussellSpitzer , can you suggest if there is an alternate approach? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org