pseudomuto opened a new issue, #10110:
URL: https://github.com/apache/iceberg/issues/10110

   ### Apache Iceberg version
   
   1.4.3
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I'm having trouble running the RewriteDataFiles action in Spark. I have a 
table with ~60B records partitioned by domain and day. When I try to run the 
job, all the stages complete successfully and appear to have rewritten the 
data, but the final stage fails with an NPE inside of Nessie's JavaHttpClient, 
which results in no commit being made to the table.
   
   I have tried enabling partial progress and tweaking the max commits, but I'm 
not able to successfully commit regardless of those settings (at least the 
combinations I've tried).
   
   Oddly enough, the import jobs are working, using Spark and Nessie as well 
(same properties), without issue. I'm wondering if this is a bug or if anyone 
is able to shed some light on what might be going on here.
   
   **Code**
   
   ```java
   // "spark" is an existing context with the properties (below) configured
   SparkActions.get(spark)
       .rewriteDataFiles(table)
       .option(RewriteDataFilesSparkAction.MAX_CONCURRENT_FILE_GROUP_REWRITES, 
"1000")
       .option(RewriteDataFilesSparkAction.TARGET_FILE_SIZE_BYTES, "536870912")
       .filter(Expressions.equal(Expressions.day("occurred_at"), 19804))
       .execute();
   ```
   
   **Spark Properties**
   
   ```
   spark.sql.catalog.nessie.cache-enabled=false
   spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog
   spark.sql.catalog.nessie.client-api-version=2
   spark.sql.catalog.nessie.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO
   spark.sql.catalog.nessie.ref=main
   spark.sql.catalog.nessie.uri=https://<domain>/api/v2
   spark.sql.catalog.nessie.warehouse=gs://<bucket>/<dir>
   spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog
   spark.sql.defaultCatalog=nessie
   
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions
   spark.sql.catalog.nessie.authentication.type=BEARER
   
spark.sql.catalog.nessie.quarkus.oidc.auth-server-url=https://accounts.google.com
   spark.sql.catalog.nessie.quarkus.oidc.client-id=<google_client_id>
   spark.hadoop.parquet.enable.summary-metadata=false
   spark.sql.parquet.mergeSchema=false
   spark.sql.parquet.filterPushdown=true
   spark.sql.source.partitionOverviewMode=dynamic
   spark.sql.hive.metastorePartitionPruning=true
   spark.sql.files.maxPartitionBytes=1073741824
   spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
   
spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
   spark.hadoop.fs.gs.auth.service.account.enable=true
   ```
   
   **Logs from the Job**
   
   
![spark-logs](https://github.com/apache/iceberg/assets/4748863/6d4aecfc-cd54-4b4f-82f1-b90ae5034cf8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to