amogh-jahagirdar commented on code in PR #13555: URL: https://github.com/apache/iceberg/pull/13555#discussion_r2219496449
########## spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java: ########## @@ -163,6 +165,14 @@ private Spark3Util.CatalogAndIdentifier catalogAndIdentifier(CaseInsensitiveStri selector = TAG_PREFIX + tag; } + String groupId = + options.getOrDefault( + SparkReadOptions.SCAN_TASK_SET_ID, + options.get(SparkWriteOptions.REWRITTEN_FILE_SCAN_TASK_SET_ID)); + if (groupId != null) { + selector = REWRITE_PREFIX + groupId.replace("-", ""); Review Comment: I think either is fine, since we just match against the `rewrite_` in the selector metadata. I just replaced the hyphens in the UUID with empty string so it's a bit smaller but still maintains the uniqueness that can be used in debugging. ########## spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java: ########## @@ -944,16 +983,15 @@ public void testBinPackCombineMixedFiles() { shouldHaveFiles(table, 3); List<Object[]> expectedRecords = currentData(); - int targetSize = averageFileSize(table); - long dataSizeBefore = testDataSize(table); Result result = basicRewrite(table) .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Integer.toString(targetSize + 1000)) .option( SizeBasedFileRewritePlanner.MAX_FILE_SIZE_BYTES, - Integer.toString(targetSize + 80000)) + // Increase max file size for V3 to account for additional row lineage fields + Integer.toString(targetSize + (formatVersion >= 3 ? 1850000 : 80000))) Review Comment: Ok I do think for now it's probably OK to bump these up in the tests to account for the additional lineage fields leading to rolling over files earlier. I did do some more testing with parquet V2 and delta encoding and that looks pretty promising in terms of being able to fit more records with lineage in a given file (since the row IDs on carry over at least on this test tend to increase sequentially in a given file). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org