laskoviymishka commented on code in PR #1184:
URL: https://github.com/apache/iceberg-go/pull/1184#discussion_r3412827182
##########
table/writer.go:
##########
@@ -38,6 +38,13 @@ type WriteTask struct {
FileCount int // FileCount is a sequential counter for files written
by this task.
Schema *iceberg.Schema
Batches []arrow.RecordBatch
+ // SortOrderID, when non-zero, is recorded on the resulting file as its
+ // sort order. Set it only when Batches are fully sorted by that order —
+ // the writer does not verify or enforce the claim. Only the bin-packed
+ // task path (defaultDataFileWriter.writeFile) honors it; the rolling
+ // data writer builds its own file info and ignores it, and position
+ // delete content must leave it zero (the spec requires a null sort
+ // order id there).
SortOrderID int
Review Comment:
This is an exported field whose behavior depends on which internal writer
happens to process the task — honored by the bin-packed path, silently dropped
by the rolling path. The doc is honest about it, but "set this field and it may
or may not do anything depending on internal routing you can't see" is a pretty
fragile contract for a public type.
I'd either move `SortOrderID` onto a narrower task type that only the
bin-packed path constructs, or add a guard in
`openFileWriter`/`newRollingDataWriter` that errors (don't panic — see the
goroutine concern) on a non-zero claim it's about to ignore, mirroring the
pos-delete invariant. Right now a downstream caller that pre-sorts and sets
this through the rolling path believes the manifest is stamped when it isn't.
Thoughts on splitting the type?
##########
table/internal/utils.go:
##########
@@ -330,7 +337,18 @@ func (d *DataFileStatistics) ToDataFile(opts DataFileOpts)
iceberg.DataFile {
bldr.EqualityFieldIDs(d.EqualityFieldIDs)
}
- bldr.SortOrderID(opts.SortOrderID)
+ // Position deletes are ordered by (file_path, pos), never by a table
sort
+ // order; the spec requires their sort order id to stay null. This is an
+ // internal invariant, not user-input validation: no exported API
accepts
+ // a WriteTask or reaches this package, and the pos-delete producers
never
+ // set a claim — a non-zero value here is a library bug.
+ if opts.Content == iceberg.EntryContentPosDeletes && opts.SortOrderID
!= unsortedSortOrderID {
Review Comment:
I think the invariant is right but the `panic` is in a risky spot.
`ToDataFile` is reached through `ParquetFileWriter.Close()`, which runs in the
rolling writer's `stream` goroutine — and that goroutine has no `recover()`,
unlike `recordsToDataFiles` and `equalityDeleteRecordsToDataFiles`, which both
convert panics to errors on their error channel. So if this ever fires from
that path it takes down the process instead of surfacing an error.
It's unreachable from the public API today, agreed. But I'd rather not leave
a process-killing panic on a code path a future refactor could wander into. Two
options: validate the content/claim combination earlier on the calling
goroutine (in `writeFile` before `WriteDataFile`) where the content type is
already known, or have `ToDataFile` return `(DataFile, error)` and let callers
propagate it.
If we keep the panic, I'd at least add a `recover()` in `stream`'s top-level
defer that routes to `r.errorCh`, mirroring the sibling writers, and a comment
stating the panic is an unreachable last-resort guard. wdyt?
##########
table/write_records_sort_test.go:
##########
@@ -138,8 +138,9 @@ func (s *WriteRecordsSortTestSuite)
TestPartitionedTableSortedPerFile() {
}
for _, df := range dataFiles {
- s.NotNil(df.SortOrderID(), "data file must carry the sort order
id")
- s.Equal(meta.DefaultSortOrder(), *df.SortOrderID())
+ // The writer only sorts per batch, so it never claims the
table sort
+ // order on the file (the spec's sort_order_id is a per-file
claim).
+ s.Nil(df.SortOrderID(), "data file must not claim a sort order
id")
Review Comment:
While we're tightening sort assertions:
`TestUnpartitionedTableMultipleBatchesEachSorted` (further down in this file)
doesn't actually distinguish per-batch from global sort. Its expected sequence
`[1,2,3,4,5,6]` passes whether the writer sorts each batch independently or
sorts globally, because the batches happen to be monotonic — so the only real
regression guard there is the nil `sort_order_id` check, not the ordering.
I'd feed it non-monotonic, overlapping batches (e.g. `[5,3,1]` then
`[4,2,0]`) and assert the locally-sorted-but-not-global result `[1,3,5,0,2,4]`.
That fails the moment someone introduces a global sort, which is exactly the
behavior this whole PR is drawing a line under.
##########
table/rolling_data_writer_test.go:
##########
@@ -313,8 +313,8 @@ func (s *RollingDataWriterTestSuite)
TestWriterFactoryPropagatesSortOrderID() {
}
s.Require().Len(dataFiles, 1)
- s.Require().NotNil(dataFiles[0].SortOrderID(), "SortOrderID must be set
on the emitted DataFile")
- s.Equal(expectedSortOrderID, *dataFiles[0].SortOrderID())
+ s.Nil(dataFiles[0].SortOrderID(),
Review Comment:
This nicely pins that the rolling writer doesn't stamp the file. What's
still missing is a test for the other half of the `WriteTask.SortOrderID`
contract: a caller setting a non-zero claim and routing it through the path
that ignores it.
I'd add a case here that constructs a `WriteTask` with a non-zero
`SortOrderID`, runs it through the rolling/factory path on a sorted batch, and
asserts `DataFile.SortOrderID()` is still nil. That's the only thing that pins
"rolling writer ignores the claim" — otherwise a future change that starts
honoring it goes unnoticed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]