RussellSpitzer opened a new pull request, #13947: URL: https://github.com/apache/iceberg/pull/13947
One of our slowest test suites is TestRewrite* and each version we add to test increases that burden. So I decided to take some quick changes we could do to reduce the burden. I noticed that most of our slow down comes from collecting the data to check for integrity. We are doing so using a Spark side sort which is extemely expensive because of the number of tasks we are reading. We have a lot of tasks because we are forcibly changing the split size of the table to trigger different file combinations in the suite but this behavior isn't important when checking for integrity. I did the following experiments. ---- ## Original Performance <img width="661" height="144" alt="Pasted Graphic" src="https://github.com/user-attachments/assets/6b16c5e9-f0ec-4fac-9bd2-e6e870fbfc78" /> ## Using Single Partition Sort ```java spark.read().format("iceberg").load(tableLocation).coalesce(1).sort("c1", "c2", "c3").collectAsList()); ``` <img width="661" height="144" alt="Pasted Graphic 1" src="https://github.com/user-attachments/assets/fda70f34-dcb5-43e1-b627-e1e35b8e88b5" />  ## Using a local sort instead of Spark sort ```java List<Row> rows = spark.read().format("iceberg").load(tableLocation).collectAsList(); rows.sort(Comparator.comparingInt((Row r) -> r.getAs("c1")) .thenComparing(r -> r.getAs("c2")) .thenComparing(r -> r.getAs("c3"))); return rowsToJava(rows); ``` <img width="661" height="144" alt="Pasted Graphic 2" src="https://github.com/user-attachments/assets/23e776cd-f7e5-470a-b669-acd5509f2e20" />  ## Minimizing splits used to read ```java protected List<Object[]> currentData() { return rowsToJava( spark .read() .option(SparkReadOptions.SPLIT_SIZE, 1024 * 1024 * 32) .option(SparkReadOptions.FILE_OPEN_COST, 0) .format("iceberg").load(tableLocation) .coalesce(1) .sort("c1", "c2", "c3").collectAsList() ); } ``` ------ ## Final Suite Timings ### Before <img width="636" height="66" alt="Pasted Graphic 5" src="https://github.com/user-attachments/assets/25138b12-2143-4e8d-b928-9d55e155603a" /> ### After <img width="636" height="66" alt="Pasted Graphic 4" src="https://github.com/user-attachments/assets/34cfcd0a-3ec4-40a9-a28a-dceb4399b73e" /> ### Before <img width="636" height="66" alt="Pasted Graphic 7" src="https://github.com/user-attachments/assets/27e7bc3d-bf8f-453e-8056-197f4a1c4b3d" /> ### After <img width="636" height="66" alt="Pasted Graphic 6" src="https://github.com/user-attachments/assets/ca9b3a4b-3e95-4f03-9395-a94b60bc7fe3" /> ------ The improvements to the delete suite aren't as good but I figured I do the same changes there as well  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
