Re: [PR] fix(table): do not claim sort_order_id on files the writer did not fully sort [iceberg-go]

via GitHub Thu, 11 Jun 2026 04:12:13 -0700


laskoviymishka commented on code in PR #1184:
URL: https://github.com/apache/iceberg-go/pull/1184#discussion_r3395349624



##########
table/internal/utils.go:
##########
@@ -330,7 +337,11 @@ func (d *DataFileStatistics) ToDataFile(opts DataFileOpts) 
iceberg.DataFile {
                bldr.EqualityFieldIDs(d.EqualityFieldIDs)
        }
 
-       bldr.SortOrderID(opts.SortOrderID)
+       // Position deletes are ordered by (file_path, pos), never by a table 
sort
+       // order; the spec requires their sort order id to stay null.
+       if opts.SortOrderID != unsortedSortOrderID && opts.Content != 
iceberg.EntryContentPosDeletes {

Review Comment:
   This one `&&` is folding two invariants that want different handling. The 
`!= unsortedSortOrderID` arm is "caller made no claim" — fine to skip silently. 
But `Content == EntryContentPosDeletes && SortOrderID != 0` is a spec 
violation, and right now it gets silently swallowed at the lowest layer rather 
than surfaced.
   
   I'd split it into two sequential guards: first, if the content is 
pos-delete, skip stamping (and ideally assert `SortOrderID == 0`, since a 
non-zero claim here means a bug upstream); then, separately, if the order is 
non-unsorted, stamp it. That way the prohibited-claim case is detected instead 
of masked. wdyt?



##########
table/rolling_data_writer_test.go:
##########
@@ -313,8 +313,8 @@ func (s *RollingDataWriterTestSuite) 
TestWriterFactoryPropagatesSortOrderID() {
        }
 
        s.Require().Len(dataFiles, 1)
-       s.Require().NotNil(dataFiles[0].SortOrderID(), "SortOrderID must be set 
on the emitted DataFile")
-       s.Equal(expectedSortOrderID, *dataFiles[0].SortOrderID())
+       s.Nil(dataFiles[0].SortOrderID(),

Review Comment:
   These inversions correctly assert the defaults don't stamp, but nothing 
exercises the explicit-claim path the other direction. Nothing constructs a 
`WriteTask` with a non-zero `SortOrderID`, runs it through 
`defaultDataFileWriter.writeFile`, and asserts `DataFile.SortOrderID()` equals 
the input.
   
   That's the one path where the field is actually load-bearing, and it's 
currently untested — if the `writeFile` wiring broke, no test would catch it. 
I'd add that integration test before merge.



##########
table/writer.go:
##########
@@ -38,6 +38,9 @@ type WriteTask struct {
        FileCount   int // FileCount is a sequential counter for files written 
by this task.
        Schema      *iceberg.Schema
        Batches     []arrow.RecordBatch
+       // SortOrderID, when non-zero, is recorded on the resulting file as its

Review Comment:
   This comment overpromises. It reads as unconditional, but it's not: the 
field is honored on the `defaultDataFileWriter.writeFile` path and completely 
ignored on the `writerFactory`/`RollingDataWriter` path (`openFileWriter` 
builds its own `WriteFileInfo` and never reads it), and it's also suppressed 
for pos-delete content by the `ToDataFile` guard.
   
   A caller going through `WriteRecords` who sets this field gets a silent 
no-op. I'd qualify the doc to say it only takes effect on the batch/`writeFile` 
path and that pos-delete content drops it per spec, so the next person doesn't 
wire it up expecting it to work everywhere. Fix the doc and this is good to 
land.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(table): do not claim sort_order_id on files the writer did not fully sort [iceberg-go]

Reply via email to