zeroshade commented on issue #386: URL: https://github.com/apache/iceberg-go/issues/386#issuecomment-2848072497
> This is not quite right. The schema ID for a snapshot is not necessarily the schema in all data files. It is my understanding that you need to know the schema used to create the file to correctly interpret the data therein and to validate that the parquet schema matches the expected schema (the parquet schema, of course, should match that corresponding schema ID, not the current or latest schema, which may have evolved since). You don't need to know the schema used to create the file to correctly interpret the data. Columns are matched via field-id, when evolving a schema field-ids should never be re-used. If the types don't match, there are specific rules for how to promote data types for iceberg and when an error should be encountered. I can't think of any scenario in which you need to know what the table's schema was at the time the data file was written in order to interpret the data therein. > You can't go through the snapshot to get them. That will only tell you the schema ID of any new files that were added in that snapshot. I need to know the schema for each particular data file. I'm confused, if you want the schema of the data file you just open the data file and check its schema. Iceberg doesn't store the *data file's* schema anywhere but in the file itself. The Avro metadata you were referring to in the issue is the *table* schema, which is what you would get by looking at the Snapshot when the file was added. > So if I also knew the schema and partition spec ID for a particular manifest, I could quickly decide to skip it -- I know the data file to be removed does not appear in that manifest if it's schema and partition spec ID do not match. If you read the a ManifestList, you'll have the partition spec ID for each manifest as they would always match the Spec ID used to write the ManifestList, also the ManifestFile's themselves also know the spec ID that was used to write them. But for schema IDs, data files can be brought forward and reused (marked as "existing") and moved between manifests (compaction causing rewriting the ManifestFile but that particular data file was unchanged in the operation). The schema ID when a particular data file was written does not necessarily have any relation whatsoever with the schema ID of the table when the *manifest* that contains it was written. Consider the following scenario: 1. Schema ID = 1, data files A and B are written. Collected in Manifest X (metadata Schema ID = 1) 2. New Column is added to the table and data file C is written. Schema ID = 2, Manifest X is unchanged, Manifest Y is written for data file C (metadata Schema ID = 2) 3. We perform compaction of Manifests, and combine the records from B and C into a new data file, D. This is our new state: * DataFiles: A, D * Manifests: Z (Schema ID = 2, contains entries for A and D) Even though Data File A was written with Schema ID = 1, it's referenced by a Manifest whose metadata says "schema ID = 2". If you were to skip it by assuming that there are no files in it with "schema ID == 1" you would miss the file. (@kevinjqliu @Fokko please correct me if any of the above is wrong or invalid WRT how iceberg works) Essentially, Schema ID is not a good way to connect things for this. > So having access to this makes it easier to generate a new snapshot with those data files removed because there may be fewer manifests to scan when determining which manifests need to be re-written. Unfortunately, the only reliable way to be sure that a data file is not in a particular manifest is either by the Partition Spec ID or by the Data File Path. This is part of why Iceberg v4 will potentially allow ways to condense ManifestLists and ManifestFiles to reduce the number of files that have to be scanned. The schema ID in the metadata for a ManifestFile is simply not a reliable way of telling what the schema of the underlying data files are. Only what they should be expected to be *compatible* with. > Another task where this would be useful is to implement small file merging. It is not necessarily safe (since columns can be added and deleted as part of schema evolution) and certainly not efficient to merge parquet files that use different schemas. So having access to the schema ID in the manifest would make it easy to group data files by like schema. The table Schema ID at the time the data file was added does not necessarily tell you what the schema of the actual data file is. Only that the schema of the data file is *compatible* with the schema ID at the time of being added to the table. For example, the Fast Append operation where you simply add existing data files to a table does not mean that extra fields can't exist in the file, nor does it mean that fields which aren't required in the schema can't be missing. The only way to know the schema of the data file is to check the data file itself. @Fokko @kevinjqliu Could you guys please double check everything I said above to make sure I'm not missing something or misrepresenting something? Just wanna make sure that I'm properly understanding things here. > If the above is compelling to continue with exposing this, it would be ideal if the properties could be extracted prior to reading the entries, mainly to avoid some I/O processing when it is determined that the file's schema ID means the processor can ignore all of its entries. Despite everything above, I think it would be perfectly reasonable to add a method to `ManifestFile` which just reads and returns the Avro Metadata for the given manifest file. :smile: My primary objection honestly is simply that I think it would be misleading and dangerous to connect a particular schema ID directly to the `ManifestEntry` itself since it's not *really* representative of anything but the table schema at the time a particular *manifest* is written. But simply exposing the fields of the Avro Metadata for a given `ManifestFile` is a perfectly reasonable thing I think. It also happens to be pretty simple too :smile: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org