sdd commented on code in PR #364: URL: https://github.com/apache/iceberg-rust/pull/364#discussion_r1596342297
########## crates/iceberg/src/io.rs: ########## @@ -206,6 +205,35 @@ impl FileIO { } } +/// The struct the represents the metadata of a file. +/// +/// TODO: we can add last modified time, content type, etc. in the future. +pub struct FileMetadata { + /// The size of the file. + pub size: u64, +} + +/// Trait for reading file. +/// +/// # TODO +/// +/// It's possible for us to remove the async_trait, but we need to figure +/// out how to handle the object safety. +#[async_trait::async_trait] +pub trait FileRead: Send + Unpin + 'static { Review Comment: Maybe we could call this `IcebergFileRead` or `IcebergRead`? I suppose that's a bit redundant as it would be clear that it is from us by navigating to the `use` statement where it gets imported, but I'm just conscious that there are a lot of `Read` and `Write` traits dotted around and we could make it easier to see which one is ours at-a-glance where ever it is used in the future. ########## crates/iceberg/src/arrow/reader.rs: ########## @@ -187,3 +197,43 @@ impl ArrowReader { } } } + +/// ArrowFileReader is a wrapper around a FileRead that impls parquets AsyncFileReader. +/// +/// # TODO +/// +/// [ParquetObjectReader](https://docs.rs/parquet/latest/src/parquet/arrow/async_reader/store.rs.html#64) contains the following hints to speed up metadata loading, we can consider adding them to this struct: +/// +/// - `metadata_size_hint`: Provide a hint as to the size of the parquet file's footer. +/// - `preload_column_index`: Load the Column Index as part of [`Self::get_metadata`]. +/// - `preload_offset_index`: Load the Offset Index as part of [`Self::get_metadata`]. +struct ArrowFileReader<R: FileRead> { + meta: FileMetadata, + r: R, +} + +impl<R: FileRead> ArrowFileReader<R> { + /// Create a new ArrowFileReader + fn new(meta: FileMetadata, r: R) -> Self { + Self { meta, r } + } +} + +impl<R: FileRead> AsyncFileReader for ArrowFileReader<R> { + fn get_bytes(&mut self, range: Range<usize>) -> BoxFuture<'_, parquet::errors::Result<Bytes>> { + Box::pin( + self.r + .read(range.start as _..range.end as _) Review Comment: This `range.start as _..range.end as _` is a bit strange, why do we have to have that cast, out of interest? ########## crates/iceberg/src/arrow/reader.rs: ########## @@ -91,12 +98,15 @@ impl ArrowReader { Ok(try_stream! { while let Some(Ok(task)) = tasks.next().await { - let parquet_reader = file_io - .new_input(task.data().data_file().file_path())? + let parquet_file = file_io + .new_input(task.data().data_file().file_path())?; + let parquet_metadata = parquet_file.metadata().await?; + let parquet_reader =parquet_file Review Comment: Can we use `futures::try_join` here to run both of these futures simultaneously rather than sequentially? ```rust let (parquet_metadata, parquet_reader) = try_join!(parquet_file.metadata(), parquet_file.reader())?; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org