mbutrovich commented on issue #2607:
URL: https://github.com/apache/iceberg-rust/issues/2607#issuecomment-4658895206

   > ## 2. `_pos` metadata column
   > 
   > **Description:** The ordinal row position (0-based) within the source data 
file. Unlike `_file` and `_spec_id`, this is NOT a constant -- it increases 
monotonically across batches within a file.
   > 
   > **Changes:**
   > 
   > * New `ColumnSource` variant in `RecordBatchTransformer` (e.g., 
`RowPosition`) that generates sequential `Int64Array` values
   > * Mutable state in the transformer tracking the row offset across batches 
within a file. After each batch of N rows, `start_offset += N`.
   > * Handle split reads: if `FileScanTask` reads a portion of a file, the 
initial position offset must account for rows before the split (from Parquet 
row group's row index offset)
   > * In `pipeline.rs`, detect `RESERVED_FIELD_ID_POS` in projected fields and 
configure the transformer accordingly
   > * Must use the same 0-based numbering semantics as positional delete files
   > 
   > **Design considerations:**
   > 
   > * iceberg-rust currently handles positional deletes via a separate 
`DeleteVector`/`RowSelection` pre-filtering mechanism. The `_pos` column is 
architecturally independent but must agree on numbering.
   > * In Java, `PositionVectorReader` gets `rowStart` from 
`PageReadStore.getRowIndexOffset()` per row group, then fills `[rowStart, 
rowStart+1, ..., rowStart+N-1]` per batch.
   
   Is this semantically different than the `row_number` virtual column that 
arrow-rs already supports that I recently threaded through DataFusion? 
https://github.com/apache/datafusion/pull/22026


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to