jhump commented on issue #386:
URL: https://github.com/apache/iceberg-go/issues/386#issuecomment-2847978989

   > Both the ManifestFile and the ManifestEntry would contain the SnapshotID 
from the point when they were added to the table, so you should be able to 
retrieve the Schema ID and Schema by looking at the corresponding Snapshot to 
that ID.
   
   This is not quite right. The schema ID for a snapshot is not necessarily the 
schema in all data files. It is my understanding that you need to know the 
schema used to create the file to correctly interpret the data therein and to 
validate that the parquet schema matches the expected schema (the parquet 
schema, of course, should match that corresponding schema ID, _not_ the current 
or latest schema, which may have evolved since).
   
   > If we were to embed the Schema information into the manifest 
entry/manifest file, consumers might mistakenly believe it shows the current 
schema of a table as opposed to just reflecting the schema at the time the 
manifest was written which may or may not be the same as the current one.
   
   I don't know why that would be the case. It can be clearly documented that 
it's the schema that was used to create that manifest and all data files 
therein. I certainly would never expect it to be the same as the current 
schema, since the schema can be continually evolving and changing over time.
   
   > In the interests of avoiding this possible ambiguity, can you expand more 
on why you want to expose the schema directly through the ManifestFile and 
ManifestEntry as opposed to having to go through the Snapshot to get them?
   
   You can't go through the snapshot to get them. That will only tell you the 
schema ID of any _new_ files that were added in that snapshot. I need to know 
the schema for each particular data file.
   
   I need to know this to implement efficient removal of data files, for 
enforcing a retention policy. My application has its own metadata about each 
parquet file it generates (because it actually uses the parquet files as its 
own durable store of records, independent of the Iceberg table). So it knows 
the schema and partition spec ID of each data file. So if I also knew the 
schema and partition spec ID for a particular manifest, I could quickly decide 
to skip it -- I know the data file to be removed does not appear in that 
manifest if it's schema and partition spec ID do not match. As you pointed out, 
the partition spec ID can already be had from the `ManifestFile` struct that is 
in the manifest list; but the schema ID is only present in the manifest file's 
metadata. So having access to this makes it easier to generate a new snapshot 
with those data files removed because there may be fewer manifests to scan when 
determining which manifests need to be re-written.
   
   Another task where this would be useful is to implement small file merging. 
It is not necessarily safe (since columns can be added and deleted as part of 
schema evolution) and certainly not efficient to merge parquet files that use 
different schemas. So having access to the schema ID in the manifest would make 
it easy to group data files by like schema. In my case, I will also be grouping 
by like partition spec, because a different partition spec could possibly mean 
having to re-partition spec, which would mean possibly breaking it up into 
multiple new files, whereas the intent is to go the other direction.
   
   > A question, is there ever a scenario where you'd want to read the metadata 
file without also reading the entries?
   
   If the above is compelling to continue with exposing this, it would be ideal 
if the properties could be extracted prior to reading the entries, mainly to 
avoid some I/O processing when it is determined that the file's schema ID means 
the processor can ignore all of its entries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to