Re: [PR] Spark : Derive Stats From Manifest on the Fly [iceberg]

via GitHub Fri, 22 Nov 2024 09:02:37 -0800


saitharun15 commented on code in PR #11615:
URL: https://github.com/apache/iceberg/pull/11615#discussion_r1854296151



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -194,10 +205,40 @@ protected Statistics estimateStatistics(Snapshot 
snapshot) {
     Map<NamedReference, ColumnStatistics> colStatsMap = Collections.emptyMap();
     if (readConf.reportColumnStats() && cboEnabled) {
       colStatsMap = Maps.newHashMap();
+      Map<Integer, Long> ndvs = Maps.newHashMap();
+      Map<Integer, Long> nullCounts = Maps.newHashMap();
+      Map<Integer, Object> minValues = Maps.newHashMap();
+      Map<Integer, Object> maxValues = Maps.newHashMap();
       List<StatisticsFile> files = table.statisticsFiles();
       if (!files.isEmpty()) {
         List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
 
+        if (readConf.deriveStatsFromManifestSessionConf()
+            || readConf.deriveStatsFromManifestTableProperty()) {
+          Map<String, Map<Integer, Long>> distinctDataFilesNullCount = 
Maps.newHashMap();

Review Comment:
   We found that different FileScanTask objects in the taskGroup were pointing 
to the same data file, causing duplicates. I used putIfAbsent to ensure a 
single entry per file and renamed the maps to distinctDataFiles. However after 
switching to collect FileScanTask objects as a set, putIfAbsent is no longer 
needed. @RussellSpitzer , can you suggest if there is an alternate approach?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark : Derive Stats From Manifest on the Fly [iceberg]

Reply via email to