ajantha-bhat commented on code in PR #9437: URL: https://github.com/apache/iceberg/pull/9437#discussion_r1455763573
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java: ########## @@ -150,6 +154,21 @@ protected Dataset<FileInfo> contentFileDS(Table table, Set<Long> snapshotIds) { Broadcast<Table> tableBroadcast = sparkContext.broadcast(serializableTable); int numShufflePartitions = spark.sessionState().conf().numShufflePartitions(); + return manifestBeanDS(table, snapshotIds, numShufflePartitions) + .flatMap(new ReadManifest(tableBroadcast), FileInfo.ENCODER); + } + + protected Dataset<PartitionEntryBean> partitionEntryDS(Table table) { + Table serializableTable = SerializableTableWithSize.copyOf(table); + Broadcast<Table> tableBroadcast = sparkContext.broadcast(serializableTable); + int numShufflePartitions = spark.sessionState().conf().numShufflePartitions(); + + return manifestBeanDS(table, null, numShufflePartitions) Review Comment: > Is it actually correct? This code would go via ALL_MANIFESTS table. Shouldn't we only look for manifests in a particular snapshot for which we compute the stats? True. I got confused with `snapshot().allManifests()` to all manifest table. I need to change this. And Thanks for detailed distributed and local algorithm. For my distributed algorithm, I faced problem with serialization of `partitionData` (avro class issue) thats why I had to keep most of the logic at Driver. I am not fully aware about how to implement the distributed algorithm that you have suggested. I will explore on that. In the mean time you can also review https://github.com/apache/iceberg/pull/9170 (which is independent and prerequisite for this PR) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org