Re: [PR] Spark: Fix Z-order UDF to correctly handle DateType [iceberg]

via GitHub Thu, 18 Sep 2025 15:29:06 -0700


Ferdinanddb commented on code in PR #14108:
URL: https://github.com/apache/iceberg/pull/14108#discussion_r2361311963



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java:
##########
@@ -2645,4 +2645,50 @@ public boolean matches(RewriteFileGroup argument) {
       return groupIDs.contains(argument.info().globalIndex());
     }
   }
+
+  @TestTemplate
+  public void testZOrderWithDateColumn() {
+    spark.conf().set("spark.sql.ansi.enabled", "false");

Review Comment:
   @ronkapoor86 Ok that is weird - I cloned the repo, did the same change as 
your PR in `SparkZOrderUDF.java`, built the JAR,, then executed the following 
code:
   ```python
   from pyspark.sql import SparkSession
   
   catalog_name = "biglakeCatalog"
   
   spark: SparkSession = (
       SparkSession.builder.appName("Richfox Data Loader")
       .master("local[12]")
       .config("spark.driver.memory", "18g")
       .config("spark.jars.ivy", "/tmp/.ivy_spark")
       .config(
           "spark.jars",
           
"https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.7/postgresql-42.7.7.jar,";
           # 
"https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-4.0_2.13/1.10.0/iceberg-spark-runtime-4.0_2.13-1.10.0.jar,";
           
"/home/mypath/work/perso/iceberg/spark/v4.0/spark-runtime/build/libs/iceberg-spark-runtime-4.0_2.13-1d558a9.dirty.jar,"
           
"https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-gcp-bundle/1.10.0/iceberg-gcp-bundle-1.10.0.jar,";
           
"https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/3.1.7/gcs-connector-3.1.7-shaded.jar,";
           
"https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.3.6/hadoop-common-3.3.6.jar";,
       )
       .config("spark.sql.execution.arrow.pyspark.enabled", "true")
       .config(f"spark.sql.catalog.{catalog_name}", 
"org.apache.iceberg.spark.SparkCatalog")
       .config(f"spark.sql.catalog.{catalog_name}.type", "rest")
       .config(f"spark.sql.catalog.{catalog_name}.uri", 
"https://biglake.googleapis.com/iceberg/v1beta/restcatalog";)
       .config(f"spark.sql.catalog.{catalog_name}.warehouse", "gs://some 
bucket")
       .config(f"spark.sql.catalog.{catalog_name}.header.x-goog-user-project", 
"some project")
       .config(f"spark.sql.catalog.{catalog_name}.rest.auth.type", 
"org.apache.iceberg.gcp.auth.GoogleAuthManager")
       .config(f"spark.sql.catalog.{catalog_name}.io-impl", 
"org.apache.iceberg.gcp.gcs.GCSFileIO")
       
.config(f"spark.sql.catalog.{catalog_name}.rest-metrics-reporting-enabled", 
"false")
       .config("spark.hadoop.fs.gs.project.id", "some project")
       .config("spark.hadoop.fs.gs.impl", 
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
       .config("spark.hadoop.google.cloud.auth.service.account.enable", "true")
       .config("spark.hadoop.fs.gs.auth.type", "APPLICATION_DEFAULT")
       .config(
           "spark.sql.extensions",
           "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
       )
       .getOrCreate()
   )
   
   spark.sql("""--sql
   CALL biglakeCatalog.system.rewrite_data_files(
       table => 'biglakeCatalog.silver.cumu_adj_factors_daily',
       strategy => 'sort', sort_order => 'zorder(ticker,sec_id,trade_date)',
       options => map('rewrite-all', 'true', 'target-file-size-bytes', 
'536870912', 'max-concurrent-file-group-rewrites', '5')
   );
   """).show()
   
   
+--------------------------+----------------------+---------------------+-----------------------+--------------------------+
   
|rewritten_data_files_count|added_data_files_count|rewritten_bytes_count|failed_data_files_count|removed_delete_files_count|
   
+--------------------------+----------------------+---------------------+-----------------------+--------------------------+
   |                         1|                     1|                 1998|    
                  0|                         0|
   
+--------------------------+----------------------+---------------------+-----------------------+--------------------------+
   ```
   
   where:
   - `ticker` is a STRING column
   - `sec_id` is an INT column
   - `trade_date` is a DATE column
   
   And it works fine as you can see. Or am I missing something?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Fix Z-order UDF to correctly handle DateType [iceberg]

Reply via email to