zeroshade commented on issue #386:
URL: https://github.com/apache/iceberg-go/issues/386#issuecomment-2848072497

   > This is not quite right. The schema ID for a snapshot is not necessarily 
the schema in all data files. It is my understanding that you need to know the 
schema used to create the file to correctly interpret the data therein and to 
validate that the parquet schema matches the expected schema (the parquet 
schema, of course, should match that corresponding schema ID, not the current 
or latest schema, which may have evolved since).
   
   You don't need to know the schema used to create the file to correctly 
interpret the data. Columns are matched via field-id, when evolving a schema 
field-ids should never be re-used. If the types don't match, there are specific 
rules for how to promote data types for iceberg and when an error should be 
encountered. I can't think of any scenario in which you need to know what the 
table's schema was at the time the data file was written in order to interpret 
the data therein. 
   
   > You can't go through the snapshot to get them. That will only tell you the 
schema ID of any new files that were added in that snapshot. I need to know the 
schema for each particular data file.
   
   I'm confused, if you want the schema of the data file you just open the data 
file and check its schema. Iceberg doesn't store the *data file's* schema 
anywhere but in the file itself. The Avro metadata you were referring to in the 
issue is the *table* schema, which is what you would get by looking at the 
Snapshot when the file was added.
   
   >  So if I also knew the schema and partition spec ID for a particular 
manifest, I could quickly decide to skip it -- I know the data file to be 
removed does not appear in that manifest if it's schema and partition spec ID 
do not match. 
   
   If you read the a ManifestList, you'll have the partition spec ID for each 
manifest as they would always match the Spec ID used to write the ManifestList, 
also the ManifestFile's themselves also know the spec ID that was used to write 
them. 
   
   But for schema IDs, data files can be brought forward and reused (marked as 
"existing") and moved between manifests (compaction causing rewriting the 
ManifestFile but that particular data file was unchanged in the operation). The 
schema ID when a particular data file was written does not necessarily have any 
relation whatsoever with the schema ID of the table when the *manifest* that 
contains it was written. 
   
   Consider the following scenario:
   
   1. Schema ID = 1, data files A and B are written. Collected in Manifest X 
(metadata Schema ID = 1)
   2. New Column is added to the table and data file C is written. Schema ID = 
2, Manifest X is unchanged, Manifest Y is written for data file C (metadata 
Schema ID = 2)
   3. We perform compaction of Manifests, and combine the records from B and C 
into a new data file, D. This is our new state:
       * DataFiles: A, D
       * Manifests: Z (Schema ID = 2, contains entries for A and D)
   
   Even though Data File A was written with Schema ID = 1, it's referenced by a 
Manifest whose metadata says "schema ID = 2". If you were to skip it by 
assuming that there are no files in it with "schema ID == 1" you would miss the 
file.
   
   (@kevinjqliu @Fokko please correct me if any of the above is wrong or 
invalid WRT how iceberg works)
   
   Essentially, Schema ID is not a good way to connect things for this.
   
   >  So having access to this makes it easier to generate a new snapshot with 
those data files removed because there may be fewer manifests to scan when 
determining which manifests need to be re-written.
   
   Unfortunately, the only reliable way to be sure that a data file is not in a 
particular manifest is either by the Partition Spec ID or by the Data File 
Path. This is part of why Iceberg v4 will potentially allow ways to condense 
ManifestLists and ManifestFiles to reduce the number of files that have to be 
scanned. The schema ID in the metadata for a ManifestFile is simply not a 
reliable way of telling what the schema of the underlying data files are. Only 
what they should be expected to be *compatible* with.
   
   > Another task where this would be useful is to implement small file 
merging. It is not necessarily safe (since columns can be added and deleted as 
part of schema evolution) and certainly not efficient to merge parquet files 
that use different schemas. So having access to the schema ID in the manifest 
would make it easy to group data files by like schema.
   
   The table Schema ID at the time the data file was added does not necessarily 
tell you what the schema of the actual data file is. Only that the schema of 
the data file is *compatible* with the schema ID at the time of being added to 
the table. For example, the Fast Append operation where you simply add existing 
data files to a table does not mean that extra fields can't exist in the file, 
nor does it mean that fields which aren't required in the schema can't be 
missing. The only way to know the schema of the data file is to check the data 
file itself. 
   
   @Fokko @kevinjqliu Could you guys please double check everything I said 
above to make sure I'm not missing something or misrepresenting something? Just 
wanna make sure that I'm properly understanding things here.
   
   > If the above is compelling to continue with exposing this, it would be 
ideal if the properties could be extracted prior to reading the entries, mainly 
to avoid some I/O processing when it is determined that the file's schema ID 
means the processor can ignore all of its entries.
   
   Despite everything above, I think it would be perfectly reasonable to add a 
method to `ManifestFile` which just reads and returns the Avro Metadata for the 
given manifest file. :smile: My primary objection honestly is simply that I 
think it would be misleading and dangerous to connect a particular schema ID 
directly to the `ManifestEntry` itself since it's not *really* representative 
of anything but the table schema at the time a particular *manifest* is 
written. But simply exposing the fields of the Avro Metadata for a given 
`ManifestFile` is a perfectly reasonable thing I think. It also happens to be 
pretty simple too :smile:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to