Sl1mb0 opened a new issue, #778:
URL: https://github.com/apache/iceberg-rust/issues/778

   At the moment, the building and serialization of Iceberg metadata is coupled 
together.
   
   For example, let's say I want to build a `ManifestFile` that I then add to a 
`ManifestList`:
   
   (some code has not been included for the sake of brevity)
   
   ```rust
   let manifest_file_path = NamedTempFile::new().unwrap();
   let manifest_file_output = FileIOBuilder::new_fs_io()
       .build()
       .unwrap()
       .new_output(manifest_file_path.path().to_str().unwrap())
       .unwrap();
   
   
   let manifest_writer = ManifestWriter::new(manifest_file_output, 0, 
Vec::new());
   
   let manifest_file = manifest_writer
       .write(manifest)
       .await
       .unwrap()
       
   let manifest_list_path = NamedTempFile::new().unwrap();
   let manifest_list_output = FileIOBuilder::new_fs_io()
       .build()
       .unwrap()
       .new_output(manifest_list_path.path().to_str().unwrap())
       .unwrap();
   
   let mut writer = ManifestListWriter::v2(manifest_list_output,0,0,0);
   
   writer.add_manifests(vec![manifest_file]);
   
   writer.close().await.unwrap();
   ```
   - There is an abstract coupling of building and serialization: in order to 
'build' a `ManifestFile` you have to 'write' a `Manifest`.
   - There is another abstract coupling of building/serde: The _where this 
metadata gets written to_ is included in the _what metadata is written_
     - When you specify a location to write a `ManifestFile` to - that location 
is where the `ManifestFile`  gets written to _and is [included in the 
metadata](https://github.com/apache/iceberg-rust/blob/42aff04658a00b390122260dbbeaf512d11af61f/crates/iceberg/src/spec/manifest.rs#L305)
 of that `ManifestFile`_
     - This means that when the built `ManifestFile` is added to a 
`ManifestList`, the location of the `ManifestFile` is what's used to 'point' 
the `ManifestList` to that `ManifestFile`
   - This coupling forces the user to use the `FileIO`/`OutputFile`/`InputFile` 
type to write to their preferred storage layer instead of allowing the user to 
build/use their own abstractions for "where the bytes get written to"
     - We would really like to separate the building and serialization layers 
as that will allow us to use our own storage layer abstractions.
     - To provide an example: if the user wants to use their own storage layer 
for storing metadata bytes
       - They must build/write all the necessary metadata types using `FileIO`
       - They would then need to 'copy' all these bytes to their preferred 
storage layer
       - :warning: **problem** :warning: 
         - Because the metadata itself contains "where" the metadata is once 
that metadata is "moved" somewhere else, it's no longer valid. This is because 
the 'metadata hierarchy' (IE which metadata points to which snapshot points to 
which manifest list etc) is only valid for where it was built/serialized. To 
illustrate this:
         
         
   
![image](https://github.com/user-attachments/assets/1634c21e-8d65-430e-9452-f8061d902feb)
   
   In the above example the `ManifestList` and `ManifestFile` were 
built/serialized on `Node B` and then copied over to `Node A` but because the 
building/serialization was performed on `Node B` - the `ManifestList` on `Node 
A` points to the `ManifestFile` on `Node B`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to