aokolnychyi commented on PR #8123:
URL: https://github.com/apache/iceberg/pull/8123#issuecomment-1645972945

   @zinking, I know that paper and there are a few ideas in it that may be 
applicable to us. At the same time, Iceberg metadata already forms a system 
table which we query in a distributed manner (`data_files`, etc), it is similar 
to having a separate table for metadata. If I remember correctly, one benefit 
of BigQuery is that it does not have to bring back the results while doing this 
distributed planning but that's on Spark side to provide that functionality. In 
my view, it is less likely one would query 20 PB of data in a single job and 
when that happens, it is unlikely to be a problem to spend 30 seconds planning 
the job.
   
   At this point, we are not storing large blobs in the manifests so we will 
come back to that paper while discussing how to support secondary indexes and 
how to integrate that into planning. That would be the point when the metadata 
would be too big. Right now, we are talking about 2-4 GB of metadata to cover 
20-40 PB of data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to