Re: [PR] feat: Extract FileRead and FileWrite trait [iceberg-rust]

via GitHub Thu, 09 May 2024 23:57:15 -0700


sdd commented on code in PR #364:
URL: https://github.com/apache/iceberg-rust/pull/364#discussion_r1596342297



##########
crates/iceberg/src/io.rs:
##########
@@ -206,6 +205,35 @@ impl FileIO {
     }
 }
 
+/// The struct the represents the metadata of a file.
+///
+/// TODO: we can add last modified time, content type, etc. in the future.
+pub struct FileMetadata {
+    /// The size of the file.
+    pub size: u64,
+}
+
+/// Trait for reading file.
+///
+/// # TODO
+///
+/// It's possible for us to remove the async_trait, but we need to figure
+/// out how to handle the object safety.
+#[async_trait::async_trait]
+pub trait FileRead: Send + Unpin + 'static {

Review Comment:
   Maybe we could call this `IcebergFileRead` or `IcebergRead`? I suppose 
that's a bit redundant as it would be clear that it is from us by navigating to 
the `use` statement where it gets imported, but I'm just conscious that there 
are a lot of `Read` and `Write` traits dotted around and we could make it 
easier to see which one is ours at-a-glance where ever it is used in the future.



##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -187,3 +197,43 @@ impl ArrowReader {
         }
     }
 }
+
+/// ArrowFileReader is a wrapper around a FileRead that impls parquets 
AsyncFileReader.
+///
+/// # TODO
+///
+/// 
[ParquetObjectReader](https://docs.rs/parquet/latest/src/parquet/arrow/async_reader/store.rs.html#64)
 contains the following hints to speed up metadata loading, we can consider 
adding them to this struct:
+///
+/// - `metadata_size_hint`: Provide a hint as to the size of the parquet 
file's footer.
+/// - `preload_column_index`: Load the Column Index  as part of 
[`Self::get_metadata`].
+/// - `preload_offset_index`: Load the Offset Index as part of 
[`Self::get_metadata`].
+struct ArrowFileReader<R: FileRead> {
+    meta: FileMetadata,
+    r: R,
+}
+
+impl<R: FileRead> ArrowFileReader<R> {
+    /// Create a new ArrowFileReader
+    fn new(meta: FileMetadata, r: R) -> Self {
+        Self { meta, r }
+    }
+}
+
+impl<R: FileRead> AsyncFileReader for ArrowFileReader<R> {
+    fn get_bytes(&mut self, range: Range<usize>) -> BoxFuture<'_, 
parquet::errors::Result<Bytes>> {
+        Box::pin(
+            self.r
+                .read(range.start as _..range.end as _)

Review Comment:
   This `range.start as _..range.end as _` is a bit strange, why do we have to 
have that cast, out of interest? 



##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -91,12 +98,15 @@ impl ArrowReader {
 
         Ok(try_stream! {
             while let Some(Ok(task)) = tasks.next().await {
-                let parquet_reader = file_io
-                    .new_input(task.data().data_file().file_path())?
+                let parquet_file = file_io
+                    .new_input(task.data().data_file().file_path())?;
+                let parquet_metadata = parquet_file.metadata().await?;
+                let parquet_reader =parquet_file

Review Comment:
   Can we use `futures::try_join` here to run both of these futures 
simultaneously rather than sequentially?
   
   ```rust
   let (parquet_metadata, parquet_reader) = try_join!(parquet_file.metadata(), 
parquet_file.reader())?;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Extract FileRead and FileWrite trait [iceberg-rust]

Reply via email to