saitharun15 commented on code in PR #11615:
URL: https://github.com/apache/iceberg/pull/11615#discussion_r1854296151
##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -194,10 +205,40 @@ protected Statistics estimateStatistics(Snapshot
snapshot) {
Map<NamedReference, ColumnStatistics> colStatsMap = Collections.emptyMap();
if (readConf.reportColumnStats() && cboEnabled) {
colStatsMap = Maps.newHashMap();
+ Map<Integer, Long> ndvs = Maps.newHashMap();
+ Map<Integer, Long> nullCounts = Maps.newHashMap();
+ Map<Integer, Object> minValues = Maps.newHashMap();
+ Map<Integer, Object> maxValues = Maps.newHashMap();
List<StatisticsFile> files = table.statisticsFiles();
if (!files.isEmpty()) {
List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
+ if (readConf.deriveStatsFromManifestSessionConf()
+ || readConf.deriveStatsFromManifestTableProperty()) {
+ Map<String, Map<Integer, Long>> distinctDataFilesNullCount =
Maps.newHashMap();
Review Comment:
We found that different FileScanTask objects in the taskGroup were pointing
to the same data file, causing duplicates. I used putIfAbsent to ensure a
single entry per file and renamed the maps to distinctDataFiles. However after
switching to collect FileScanTask objects as a set, putIfAbsent is no longer
needed. @RussellSpitzer , can you suggest if there is any other alternate
approach?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]