laskoviymishka commented on code in PR #1184:
URL: https://github.com/apache/iceberg-go/pull/1184#discussion_r3412827182


##########
table/writer.go:
##########
@@ -38,6 +38,13 @@ type WriteTask struct {
        FileCount   int // FileCount is a sequential counter for files written 
by this task.
        Schema      *iceberg.Schema
        Batches     []arrow.RecordBatch
+       // SortOrderID, when non-zero, is recorded on the resulting file as its
+       // sort order. Set it only when Batches are fully sorted by that order —
+       // the writer does not verify or enforce the claim. Only the bin-packed
+       // task path (defaultDataFileWriter.writeFile) honors it; the rolling
+       // data writer builds its own file info and ignores it, and position
+       // delete content must leave it zero (the spec requires a null sort
+       // order id there).
        SortOrderID int

Review Comment:
   This is an exported field whose behavior depends on which internal writer 
happens to process the task — honored by the bin-packed path, silently dropped 
by the rolling path. The doc is honest about it, but "set this field and it may 
or may not do anything depending on internal routing you can't see" is a pretty 
fragile contract for a public type.
   
   I'd either move `SortOrderID` onto a narrower task type that only the 
bin-packed path constructs, or add a guard in 
`openFileWriter`/`newRollingDataWriter` that errors (don't panic — see the 
goroutine concern) on a non-zero claim it's about to ignore, mirroring the 
pos-delete invariant. Right now a downstream caller that pre-sorts and sets 
this through the rolling path believes the manifest is stamped when it isn't. 
Thoughts on splitting the type?



##########
table/internal/utils.go:
##########
@@ -330,7 +337,18 @@ func (d *DataFileStatistics) ToDataFile(opts DataFileOpts) 
iceberg.DataFile {
                bldr.EqualityFieldIDs(d.EqualityFieldIDs)
        }
 
-       bldr.SortOrderID(opts.SortOrderID)
+       // Position deletes are ordered by (file_path, pos), never by a table 
sort
+       // order; the spec requires their sort order id to stay null. This is an
+       // internal invariant, not user-input validation: no exported API 
accepts
+       // a WriteTask or reaches this package, and the pos-delete producers 
never
+       // set a claim — a non-zero value here is a library bug.
+       if opts.Content == iceberg.EntryContentPosDeletes && opts.SortOrderID 
!= unsortedSortOrderID {

Review Comment:
   I think the invariant is right but the `panic` is in a risky spot. 
`ToDataFile` is reached through `ParquetFileWriter.Close()`, which runs in the 
rolling writer's `stream` goroutine — and that goroutine has no `recover()`, 
unlike `recordsToDataFiles` and `equalityDeleteRecordsToDataFiles`, which both 
convert panics to errors on their error channel. So if this ever fires from 
that path it takes down the process instead of surfacing an error.
   
   It's unreachable from the public API today, agreed. But I'd rather not leave 
a process-killing panic on a code path a future refactor could wander into. Two 
options: validate the content/claim combination earlier on the calling 
goroutine (in `writeFile` before `WriteDataFile`) where the content type is 
already known, or have `ToDataFile` return `(DataFile, error)` and let callers 
propagate it.
   
   If we keep the panic, I'd at least add a `recover()` in `stream`'s top-level 
defer that routes to `r.errorCh`, mirroring the sibling writers, and a comment 
stating the panic is an unreachable last-resort guard. wdyt?



##########
table/write_records_sort_test.go:
##########
@@ -138,8 +138,9 @@ func (s *WriteRecordsSortTestSuite) 
TestPartitionedTableSortedPerFile() {
        }
 
        for _, df := range dataFiles {
-               s.NotNil(df.SortOrderID(), "data file must carry the sort order 
id")
-               s.Equal(meta.DefaultSortOrder(), *df.SortOrderID())
+               // The writer only sorts per batch, so it never claims the 
table sort
+               // order on the file (the spec's sort_order_id is a per-file 
claim).
+               s.Nil(df.SortOrderID(), "data file must not claim a sort order 
id")

Review Comment:
   While we're tightening sort assertions: 
`TestUnpartitionedTableMultipleBatchesEachSorted` (further down in this file) 
doesn't actually distinguish per-batch from global sort. Its expected sequence 
`[1,2,3,4,5,6]` passes whether the writer sorts each batch independently or 
sorts globally, because the batches happen to be monotonic — so the only real 
regression guard there is the nil `sort_order_id` check, not the ordering.
   
   I'd feed it non-monotonic, overlapping batches (e.g. `[5,3,1]` then 
`[4,2,0]`) and assert the locally-sorted-but-not-global result `[1,3,5,0,2,4]`. 
That fails the moment someone introduces a global sort, which is exactly the 
behavior this whole PR is drawing a line under.



##########
table/rolling_data_writer_test.go:
##########
@@ -313,8 +313,8 @@ func (s *RollingDataWriterTestSuite) 
TestWriterFactoryPropagatesSortOrderID() {
        }
 
        s.Require().Len(dataFiles, 1)
-       s.Require().NotNil(dataFiles[0].SortOrderID(), "SortOrderID must be set 
on the emitted DataFile")
-       s.Equal(expectedSortOrderID, *dataFiles[0].SortOrderID())
+       s.Nil(dataFiles[0].SortOrderID(),

Review Comment:
   This nicely pins that the rolling writer doesn't stamp the file. What's 
still missing is a test for the other half of the `WriteTask.SortOrderID` 
contract: a caller setting a non-zero claim and routing it through the path 
that ignores it.
   
   I'd add a case here that constructs a `WriteTask` with a non-zero 
`SortOrderID`, runs it through the rolling/factory path on a sorted batch, and 
asserts `DataFile.SortOrderID()` is still nil. That's the only thing that pins 
"rolling writer ignores the claim" — otherwise a future change that starts 
honoring it goes unnoticed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to