Re: [PR] feat(core): Add support for `_file` column [iceberg-rust]

via GitHub Sun, 09 Nov 2025 17:33:43 -0800


liurenjie1024 commented on code in PR #1824:
URL: https://github.com/apache/iceberg-rust/pull/1824#discussion_r2508478595



##########
crates/iceberg/src/scan/mod.rs:
##########
@@ -42,6 +45,28 @@ use crate::table::Table;
 use crate::utils::available_parallelism;
 use crate::{Error, ErrorKind, Result};
 
+/// Reserved column name for the file path metadata column.
+///
+/// When this column is selected in a table scan (e.g., `.select(["col1", 
RESERVED_COL_NAME_FILE])`),
+/// each row will include the path of the file from which that row was read.
+/// This is useful for debugging, auditing, or tracking data lineage.
+///
+/// # Example
+/// ```no_run
+/// # use iceberg::scan::RESERVED_COL_NAME_FILE;
+/// # use iceberg::table::Table;
+/// # async fn example() -> iceberg::Result<()> {
+/// # let table: Table = todo!();
+/// // Select regular columns along with the file path
+/// let scan = table
+///     .scan()
+///     .select(["id", "name", RESERVED_COL_NAME_FILE])
+///     .build()?;
+/// # Ok(())
+/// # }
+/// ```
+pub const RESERVED_COL_NAME_FILE: &str = RESERVED_COL_NAME_FILE_INTERNAL;

Review Comment:
   We will have more metadata columns, so I would prefert to put these 
definition in sth like `metadata_columns` module.



##########
crates/iceberg/src/scan/mod.rs:
##########
@@ -217,9 +242,13 @@ impl<'a> TableScanBuilder<'a> {
 
         let schema = snapshot.schema(self.table.metadata())?;
 
-        // Check that all column names exist in the schema.
+        // Check that all column names exist in the schema (skip reserved 
columns).
         if let Some(column_names) = self.column_names.as_ref() {
             for column_name in column_names {
+                // Skip reserved columns that don't exist in the schema
+                if column_name == RESERVED_COL_NAME_FILE_INTERNAL {

Review Comment:
   We should have sth like `is_metadata_column_name()` in `metadata_columns` 
module, and use`is_metadata_column_name` so that we could avoid such changes 
when we add more metadata columns.



##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -523,6 +568,93 @@ impl ArrowReader {
         Ok(results.into())
     }
 
+    /// Helper function to add a `_file` column to a RecordBatch at a specific 
position.
+    /// Takes the array, field to add, and position where to insert.
+    fn create_file_field_at_position(

Review Comment:
   I think this approach is not extensible. I prefer what's similar in [this 
pr](https://github.com/apache/iceberg-rust/pull/1821/files#diff-640415fbb0b37eac6f774eff524fb364c4e54c02da3a3c0e355af5caa941007c):
   1. Add `constant_map` for `ArrowReader`
   2. Add another variant of `RecordBatchTransformer` to handle constant field 
like `_file`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(core): Add support for `_file` column [iceberg-rust]

Reply via email to