Re: [PR] Spark: Add a test to check if the bloom filters are added to the parquet files [iceberg]

via GitHub Sat, 09 Mar 2024 15:03:10 -0800


hussein-awala commented on PR #9902:
URL: https://github.com/apache/iceberg/pull/9902#issuecomment-1987003661


   @amogh-jahagirdar I tried to use spark to write the data (check 
https://github.com/apache/iceberg/pull/9902/commits/25815479628015143551c2379be5608e2dd09bd7),
 but I had an issue with the used metastore (Hive thrift):
   ```
   Serialization stack:
        - object not serializable (class: org.apache.hadoop.fs.Path, value: 
file:/var/folders/1q/drsr0xqn0mzf03hhhf1z67_40000gn/T/hive12824930955088255332/test/.hive-staging_hive_2024-03-09_23-55-13_177_3570944359873831448-1/-ext-10000)
        - field (class: org.apache.hadoop.hive.ql.plan.FileSinkDesc, name: 
dirName, type: class org.apache.hadoop.fs.Path)
        - object (class org.apache.hadoop.hive.ql.plan.FileSinkDesc, 
org.apache.hadoop.hive.ql.plan.FileSinkDesc@826a408)
        - field (class: 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1, name: 
fileSinkConfSer$1, type: class org.apache.hadoop.hive.ql.plan.FileSinkDesc)
        - object (class 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1, 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1@2d533ec6)
        - field (class: 
org.apache.spark.sql.execution.datasources.WriteJobDescription, name: 
outputWriterFactory, type: class 
org.apache.spark.sql.execution.datasources.OutputWriterFactory)
        - object (class 
org.apache.spark.sql.execution.datasources.WriteJobDescription, 
org.apache.spark.sql.execution.datasources.WriteJobDescription@260edd7e)
        - element of array (index: 0)
        - array (class [Ljava.lang.Object;, size 4)
        - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
type: class [Ljava.lang.Object;)
        - object (class java.lang.invoke.SerializedLambda, 
SerializedLambda[capturingClass=class 
org.apache.spark.sql.execution.datasources.WriteFilesExec, 
functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
 implementation=invokeStatic 
org/apache/spark/sql/execution/datasources/WriteFilesExec.$anonfun$doExecuteWrite$1:(Lorg/apache/spark/sql/execution/datasources/WriteJobDescription;Ljava/lang/String;Lorg/apache/spark/internal/io/FileCommitProtocol;Lscala/Option;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 
instantiatedMethodType=(Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 numCaptured=4])
        - writeReplace data (class: java.lang.invoke.SerializedLambda)
        - object (class 
org.apache.spark.sql.execution.datasources.WriteFilesExec$$Lambda$2347/0x000000e001f3c5f8,
 
org.apache.spark.sql.execution.datasources.WriteFilesExec$$Lambda$2347/0x000000e001f3c5f8@34bb023a)
        - element of array (index: 0)
        - array (class [Ljava.lang.Object;, size 1)
        - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
type: class [Ljava.lang.Object;)
        - object (class java.lang.invoke.SerializedLambda, 
SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
 implementation=invokeStatic 
org/apache/spark/rdd/RDD.$anonfun$mapPartitionsInternal$2$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 
instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
 numCaptured=1])
        - writeReplace data (class: java.lang.invoke.SerializedLambda)
        - object (class 
org.apache.spark.rdd.RDD$$Lambda$2349/0x000000e001f3cca8, 
org.apache.spark.rdd.RDD$$Lambda$2349/0x000000e001f3cca8@68ee800f)
        - field (class: org.apache.spark.rdd.MapPartitionsRDD, name: f, type: 
interface scala.Function3)
        - object (class org.apache.spark.rdd.MapPartitionsRDD, 
MapPartitionsRDD[3] at create at TestSparkReaderWithBloomFilter.java:210)
        - field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
        - object (class scala.Tuple2, (MapPartitionsRDD[3] at create at 
TestSparkReaderWithBloomFilter.java:210,org.apache.spark.sql.execution.datasources.FileFormatWriter$$$Lambda$2354/0x000000e001f44d28@6ff879e5))
   ``` 
   Do you have any idea?
   
   I tried to remove the metastore and use `IcebergSparkSessionExtensions` as 
SQL extension, but the class is not available in the module, I can check how to 
fix it if you prefer this solution since it's the one explained in Iceberg 
documentation for Spark users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Add a test to check if the bloom filters are added to the parquet files [iceberg]

Reply via email to