jhump commented on issue #386: URL: https://github.com/apache/iceberg-go/issues/386#issuecomment-2847978989
> Both the ManifestFile and the ManifestEntry would contain the SnapshotID from the point when they were added to the table, so you should be able to retrieve the Schema ID and Schema by looking at the corresponding Snapshot to that ID. This is not quite right. The schema ID for a snapshot is not necessarily the schema in all data files. It is my understanding that you need to know the schema used to create the file to correctly interpret the data therein and to validate that the parquet schema matches the expected schema (the parquet schema, of course, should match that corresponding schema ID, _not_ the current or latest schema, which may have evolved since). > If we were to embed the Schema information into the manifest entry/manifest file, consumers might mistakenly believe it shows the current schema of a table as opposed to just reflecting the schema at the time the manifest was written which may or may not be the same as the current one. I don't know why that would be the case. It can be clearly documented that it's the schema that was used to create that manifest and all data files therein. I certainly would never expect it to be the same as the current schema, since the schema can be continually evolving and changing over time. > In the interests of avoiding this possible ambiguity, can you expand more on why you want to expose the schema directly through the ManifestFile and ManifestEntry as opposed to having to go through the Snapshot to get them? You can't go through the snapshot to get them. That will only tell you the schema ID of any _new_ files that were added in that snapshot. I need to know the schema for each particular data file. I need to know this to implement efficient removal of data files, for enforcing a retention policy. My application has its own metadata about each parquet file it generates (because it actually uses the parquet files as its own durable store of records, independent of the Iceberg table). So it knows the schema and partition spec ID of each data file. So if I also knew the schema and partition spec ID for a particular manifest, I could quickly decide to skip it -- I know the data file to be removed does not appear in that manifest if it's schema and partition spec ID do not match. As you pointed out, the partition spec ID can already be had from the `ManifestFile` struct that is in the manifest list; but the schema ID is only present in the manifest file's metadata. So having access to this makes it easier to generate a new snapshot with those data files removed because there may be fewer manifests to scan when determining which manifests need to be re-written. Another task where this would be useful is to implement small file merging. It is not necessarily safe (since columns can be added and deleted as part of schema evolution) and certainly not efficient to merge parquet files that use different schemas. So having access to the schema ID in the manifest would make it easy to group data files by like schema. In my case, I will also be grouping by like partition spec, because a different partition spec could possibly mean having to re-partition spec, which would mean possibly breaking it up into multiple new files, whereas the intent is to go the other direction. > A question, is there ever a scenario where you'd want to read the metadata file without also reading the entries? If the above is compelling to continue with exposing this, it would be ideal if the properties could be extracted prior to reading the entries, mainly to avoid some I/O processing when it is determined that the file's schema ID means the processor can ignore all of its entries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org