Re: [PR] Use Snapshot's statistics file in SparkScan [iceberg]

via GitHub Wed, 27 Nov 2024 16:11:04 -0800


amogh-jahagirdar commented on code in PR #11040:
URL: https://github.com/apache/iceberg/pull/11040#discussion_r1861360140



##########
api/src/main/java/org/apache/iceberg/Table.java:
##########
@@ -373,4 +374,14 @@ default Snapshot snapshot(String name) {
 
     return null;
   }
+
+  /**
+   * Returns the statistics file for the given snapshot id, if available.
+   *
+   * @return the {@link StatisticsFile} for the given snapshot id, if 
available.
+   */
+  default Optional<StatisticsFile> statistics(long snapshotId) {

Review Comment:
   Thanks for adding this, I think this API is in the right direction but I 
wonder if it's taking enough information to be able to resolve the right file. 
There's an implicit assumption in the current implementation of this API that a 
Puffin file will have all the blobs needed. I think it's possible that there 
could multiple puffins, and one puffin has blob type NDV and the other puffin 
has blob type SomeFutureIndex. both of these puffin files are for snapshot 1.
   
   In the current implementation if someone wants the Puffin where there's 
SomeFutureIndex, but we happen to return the Puffin due to the generic "Find me 
a puffin with the given snapshot logic" I feel like the API isn't really doing 
what it needs.
   
   Here's the signature I'm thinking at the moment:
   
   ```
   default StatisticsFile statisticsFor(long snapshotId, String blobType)
   ```
   This will attempt to attempt to find the statistics file produced at the 
given snapshot which contains the blob type. I think this future proofs the API 
a bit more and makes it more useful in case a user is really only caring about 
a particular blob type which may or may not exist in some other file.
   
   I also think that in case a statistics file for exactly that snapshot 
couldn't be found, we should return the latest statistics rather than just 
return nothing. While it's possible that the statistics are not completely 
accurate in this approach, I think in the average case data distributions 
wouldn't change so drastically between snapshots that the statistics would work 
horribly against the query. It's probably better in the average case to have 
some statistics. If there's no there's no statistics file containing the 
desired blob then I think we'd just return null.
   
   WDYT?



##########
api/src/main/java/org/apache/iceberg/Table.java:
##########
@@ -373,4 +374,14 @@ default Snapshot snapshot(String name) {
 
     return null;
   }
+
+  /**
+   * Returns the statistics file for the given snapshot id, if available.
+   *
+   * @return the {@link StatisticsFile} for the given snapshot id, if 
available.
+   */
+  default Optional<StatisticsFile> statistics(long snapshotId) {

Review Comment:
   Thanks for adding this, I think this API is in the right direction but I 
wonder if it's taking enough information to be able to resolve the right file. 
There's an implicit assumption in the current implementation of this API that a 
Puffin file will have all the blobs needed. I think it's possible that there 
could multiple puffins, and one puffin has blob type NDV and the other puffin 
has blob type SomeFutureIndex. both of these puffin files are for snapshot 1.
   
   In the current implementation if someone wants the Puffin where there's 
SomeFutureIndex, but we happen to return the Puffin due to the generic "Find me 
a puffin with the given snapshot logic" I feel like the API isn't really doing 
what it needs.
   
   Here's the signature I'm thinking at the moment:
   
   ```
   default StatisticsFile statisticsFor(long snapshotId, String blobType)
   ```
   This will attempt to attempt to find the statistics file produced at the 
given snapshot which contains the blob type. I think this future proofs the API 
a bit more and makes it more useful in case a user is really only caring about 
a particular blob type which may or may not exist in some other Puffin file.
   
   I also think that in case a statistics file for exactly that snapshot 
couldn't be found, we should return the latest statistics rather than just 
return nothing. While it's possible that the statistics are not completely 
accurate in this approach, I think in the average case data distributions 
wouldn't change so drastically between snapshots that the statistics would work 
horribly against the query. It's probably better in the average case to have 
some statistics. If there's no there's no statistics file containing the 
desired blob then I think we'd just return null.
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Use Snapshot's statistics file in SparkScan [iceberg]

Reply via email to