mbutrovich commented on issue #2607: URL: https://github.com/apache/iceberg-rust/issues/2607#issuecomment-4658895206
> ## 2. `_pos` metadata column > > **Description:** The ordinal row position (0-based) within the source data file. Unlike `_file` and `_spec_id`, this is NOT a constant -- it increases monotonically across batches within a file. > > **Changes:** > > * New `ColumnSource` variant in `RecordBatchTransformer` (e.g., `RowPosition`) that generates sequential `Int64Array` values > * Mutable state in the transformer tracking the row offset across batches within a file. After each batch of N rows, `start_offset += N`. > * Handle split reads: if `FileScanTask` reads a portion of a file, the initial position offset must account for rows before the split (from Parquet row group's row index offset) > * In `pipeline.rs`, detect `RESERVED_FIELD_ID_POS` in projected fields and configure the transformer accordingly > * Must use the same 0-based numbering semantics as positional delete files > > **Design considerations:** > > * iceberg-rust currently handles positional deletes via a separate `DeleteVector`/`RowSelection` pre-filtering mechanism. The `_pos` column is architecturally independent but must agree on numbering. > * In Java, `PositionVectorReader` gets `rowStart` from `PageReadStore.getRowIndexOffset()` per row group, then fills `[rowStart, rowStart+1, ..., rowStart+N-1]` per batch. Is this semantically different than the `row_number` virtual column that arrow-rs already supports that I recently threaded through DataFusion? https://github.com/apache/datafusion/pull/22026 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
