aokolnychyi commented on code in PR #7692:
URL: https://github.com/apache/iceberg/pull/7692#discussion_r1204814401
##########
spark/v3.4/spark/src/jmh/java/org/apache/iceberg/spark/source/WritersBenchmark.java:
##########
@@ -363,6 +389,60 @@ public void
writeUnpartitionedClusteredPositionDeleteWriter(Blackhole blackhole)
blackhole.consume(writer);
}
+ @Benchmark
+ @Threads(1)
+ public void writeUnpartitionedFanoutPositionDeleteWriter(Blackhole
blackhole) throws IOException {
+ FileIO io = table().io();
+
+ OutputFileFactory fileFactory = newFileFactory();
+ SparkFileWriterFactory writerFactory =
+
SparkFileWriterFactory.builderFor(table()).dataFileFormat(fileFormat()).build();
+
+ FanoutPositionOnlyDeleteWriter<InternalRow> writer =
+ new FanoutPositionOnlyDeleteWriter<>(
+ writerFactory, fileFactory, io, TARGET_FILE_SIZE_IN_BYTES);
+
+ PositionDelete<InternalRow> positionDelete = PositionDelete.create();
+ try (FanoutPositionOnlyDeleteWriter<InternalRow> closeableWriter = writer)
{
+ for (InternalRow row : positionDeleteRows) {
+ String path = row.getString(0);
+ long pos = row.getLong(1);
+ positionDelete.set(path, pos, null);
+ closeableWriter.write(positionDelete, unpartitionedSpec, null);
+ }
+ }
+
+ blackhole.consume(writer);
+ }
+
+ @Benchmark
+ @Threads(1)
+ public void writeUnpartitionedFanoutPositionDeleteWriterShuffled(Blackhole
blackhole)
Review Comment:
I ran this benchmark (100 data files, 50k deletes each, 5 million deletes
total) with a GC profiler and did not see anything bad. Issues will arise when
there are lots of unique data files. That's unlikely as we distribute by
partition and this writer will still be disabled by default, so users will have
to opt in explicitly. It ain't perfect for sure but there would be reasonable
cases for it.
##########
spark/v3.4/spark/src/jmh/java/org/apache/iceberg/spark/source/WritersBenchmark.java:
##########
@@ -363,6 +389,60 @@ public void
writeUnpartitionedClusteredPositionDeleteWriter(Blackhole blackhole)
blackhole.consume(writer);
}
+ @Benchmark
+ @Threads(1)
+ public void writeUnpartitionedFanoutPositionDeleteWriter(Blackhole
blackhole) throws IOException {
+ FileIO io = table().io();
+
+ OutputFileFactory fileFactory = newFileFactory();
+ SparkFileWriterFactory writerFactory =
+
SparkFileWriterFactory.builderFor(table()).dataFileFormat(fileFormat()).build();
+
+ FanoutPositionOnlyDeleteWriter<InternalRow> writer =
+ new FanoutPositionOnlyDeleteWriter<>(
+ writerFactory, fileFactory, io, TARGET_FILE_SIZE_IN_BYTES);
+
+ PositionDelete<InternalRow> positionDelete = PositionDelete.create();
+ try (FanoutPositionOnlyDeleteWriter<InternalRow> closeableWriter = writer)
{
+ for (InternalRow row : positionDeleteRows) {
+ String path = row.getString(0);
+ long pos = row.getLong(1);
+ positionDelete.set(path, pos, null);
+ closeableWriter.write(positionDelete, unpartitionedSpec, null);
+ }
+ }
+
+ blackhole.consume(writer);
+ }
+
+ @Benchmark
+ @Threads(1)
+ public void writeUnpartitionedFanoutPositionDeleteWriterShuffled(Blackhole
blackhole)
Review Comment:
I ran this benchmark (100 data files, 50k deletes each, 5 million deletes
total) with a GC profiler and did not see anything bad. Issues will arise when
there are lots of unique data files. That's unlikely as we distribute by
partition and this writer will still be disabled by default, so users will have
to opt in explicitly. It isn't perfect for sure but there would be reasonable
cases for it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]