Re: [PR] Spark 3.5: Spark action to compute the partition stats [iceberg]

via GitHub Wed, 17 Jan 2024 06:49:47 -0800


ajantha-bhat commented on code in PR #9437:
URL: https://github.com/apache/iceberg/pull/9437#discussion_r1455763573



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java:
##########
@@ -150,6 +154,21 @@ protected Dataset<FileInfo> contentFileDS(Table table, 
Set<Long> snapshotIds) {
     Broadcast<Table> tableBroadcast = 
sparkContext.broadcast(serializableTable);
     int numShufflePartitions = 
spark.sessionState().conf().numShufflePartitions();
 
+    return manifestBeanDS(table, snapshotIds, numShufflePartitions)
+        .flatMap(new ReadManifest(tableBroadcast), FileInfo.ENCODER);
+  }
+
+  protected Dataset<PartitionEntryBean> partitionEntryDS(Table table) {
+    Table serializableTable = SerializableTableWithSize.copyOf(table);
+    Broadcast<Table> tableBroadcast = 
sparkContext.broadcast(serializableTable);
+    int numShufflePartitions = 
spark.sessionState().conf().numShufflePartitions();
+
+    return manifestBeanDS(table, null, numShufflePartitions)

Review Comment:
   > Is it actually correct? This code would go via ALL_MANIFESTS table. 
Shouldn't we only look for manifests in a particular snapshot for which we 
compute the stats?
   
   True. I got confused with `snapshot().allManifests()` to all manifest table. 
I need to change this. 
   
   And Thanks for detailed distributed and local algorithm. For my distributed 
algorithm, I faced problem with serialization of `partitionData`  (avro class 
issue) thats why I had to keep most of the logic at Driver. 
   
   I am not fully aware about how to implement the distributed algorithm that 
you have suggested. I will explore on that.
   
   In the mean time you can also review 
https://github.com/apache/iceberg/pull/9170 (which is independent and 
prerequisite for this PR)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Spark action to compute the partition stats [iceberg]

Reply via email to