pseudomuto opened a new issue, #10110: URL: https://github.com/apache/iceberg/issues/10110
### Apache Iceberg version 1.4.3 ### Query engine Spark ### Please describe the bug 🐞 I'm having trouble running the RewriteDataFiles action in Spark. I have a table with ~60B records partitioned by domain and day. When I try to run the job, all the stages complete successfully and appear to have rewritten the data, but the final stage fails with an NPE inside of Nessie's JavaHttpClient, which results in no commit being made to the table. I have tried enabling partial progress and tweaking the max commits, but I'm not able to successfully commit regardless of those settings (at least the combinations I've tried). Oddly enough, the import jobs are working, using Spark and Nessie as well (same properties), without issue. I'm wondering if this is a bug or if anyone is able to shed some light on what might be going on here. **Code** ```java // "spark" is an existing context with the properties (below) configured SparkActions.get(spark) .rewriteDataFiles(table) .option(RewriteDataFilesSparkAction.MAX_CONCURRENT_FILE_GROUP_REWRITES, "1000") .option(RewriteDataFilesSparkAction.TARGET_FILE_SIZE_BYTES, "536870912") .filter(Expressions.equal(Expressions.day("occurred_at"), 19804)) .execute(); ``` **Spark Properties** ``` spark.sql.catalog.nessie.cache-enabled=false spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog spark.sql.catalog.nessie.client-api-version=2 spark.sql.catalog.nessie.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO spark.sql.catalog.nessie.ref=main spark.sql.catalog.nessie.uri=https://<domain>/api/v2 spark.sql.catalog.nessie.warehouse=gs://<bucket>/<dir> spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog spark.sql.defaultCatalog=nessie spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions spark.sql.catalog.nessie.authentication.type=BEARER spark.sql.catalog.nessie.quarkus.oidc.auth-server-url=https://accounts.google.com spark.sql.catalog.nessie.quarkus.oidc.client-id=<google_client_id> spark.hadoop.parquet.enable.summary-metadata=false spark.sql.parquet.mergeSchema=false spark.sql.parquet.filterPushdown=true spark.sql.source.partitionOverviewMode=dynamic spark.sql.hive.metastorePartitionPruning=true spark.sql.files.maxPartitionBytes=1073741824 spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS spark.hadoop.fs.gs.auth.service.account.enable=true ``` **Logs from the Job**  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org