[I] Incorrect Results and SIGSEGV on Read with Iceberg + PySpark + Nessie [iceberg]

via GitHub Wed, 05 Feb 2025 01:29:35 -0800


agrubb86 opened a new issue, #12178:
URL: https://github.com/apache/iceberg/issues/12178


   ### Apache Iceberg version
   
   1.7.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   To quickly summarize the issue, when writing to then reading from an Iceberg 
table stored in Nessie using PySpark, query results when reading are incorrect 
(data in the parquet file is correct) and certain types of queries cause 
SIGSEGVs in the Spark executors. I'm putting these into a single issue as I 
only encounter the SIGSEGVs when reading data incorrectly in certain types of 
queries. With queries that do cause SIGSEGVs, the number of SIGSEGVs 
encountered per query ranges from 0 - 4 (job is killed after 4 executor 
failures).
   
   **Environment:**
   ```
   - aarch64
   - Java 17.0.13+11
   - Spark 3.5.4
   - Iceberg 1.7.1
   - Nessie 0.101.2
   - EKS on AWS
   ```
   
   **Jars added to Spark:**
   ```
   - avatica-core-1.25.0.jar
   - aws-java-sdk-bundle-1.12.262.jar
   - hadoop-aws-3.3.4.jar
   - hbase-shaded-mapreduce-2.4.15.jar
   - hbase-spark-1.1.0-SNAPSHOT.jar
   - hbase-spark-protocol-shaded-1.1.0-SNAPSHOT.jar
   - htrace-core4-4.2.0-incubating.jar
   - httpclient5-5.4.1.jar
   - httpcore5-5.3.1.jar
   - httpcore5-h2-5.3.1.jar
   - iceberg-aws-bundle-1.7.1.jar
   - iceberg-spark-runtime-3.5_2.12-1.7.1.jar
   - log4j-slf4j-impl-2.18.0.jar
   - mysql-connector-j-9.1.0.jar
   - nessie-spark-extensions-3.5_2.12-0.101.2.jar
   - protobuf-java-3.21.9.jar
   - slf4j-api-1.7.36.jar
   - spark-hadoop-cloud_2.12-3.5.4.jar
   
   ```
   
   **(Potentially) Relevant Spark configs:**
   ```
   spark.hadoop.fs.s3a.impl                   
org.apache.hadoop.fs.s3a.S3AFileSystem
   spark.hadoop.fs.s3a.connection.ssl.enabled false
   spark.hadoop.fs.s3a.fast.upload            true
   spark.hive.metastore.uris                  thrift://XXX:9083
   spark.kryo.registrationRequired            false
   spark.kryo.unsafe                          false
   spark.kryoserializer.buffer                1m
   spark.kryoserializer.buffer.max            1g
   spark.serializer                           
org.apache.spark.serializer.KryoSerializer
   spark.shuffle.sort.io.plugin.class         
org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
   spark.shuffle.useOldFetchProtocol          true
   spark.sql.catalogImplementation            hive
   spark.sql.catalog.nessie                   
org.apache.iceberg.spark.SparkCatalog
   spark.sql.catalog.nessie.auth_type         NONE
   spark.sql.catalog.nessie.catalog-impl      
org.apache.iceberg.nessie.NessieCatalog
   spark.sql.catalog.nessie.ref               dev
   spark.sql.catalog.nessie.uri               http://XXX:19120/api/v1
   spark.sql.catalog.nessie.warehouse         s3a://XXX/
   spark.sql.execution.arrow.pyspark.enabled  true
   spark.sql.extensions                       
org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions
   spark.sql.sources.partitionOverwriteMode   dynamic
   spark.sql.warehouse.dir                    s3a://XXX/
   spark.submit.deployMode                    cluster
   ```
   
   **Source table definition:**
   ```
   CREATE TABLE source_table (
     id1 STRING,
     id2 STRING,
     pt_col1 STRING,
     cnt INT,
     dt STRING)
   USING iceberg
   PARTITIONED BY (dt, pt_col1)
   TBLPROPERTIES (
     'format' = 'iceberg/parquet',
     'format-version' = '2',
     'gc.enabled' = 'false',
     'write.metadata.delete-after-commit.enabled' = 'false',
     'write.parquet.compression-codec' = 'zstd')
   ```
   
   **Example source data:**
   ```
   id1                | id2         | pt_col1 | cnt | dt
   1a3bcbf1-f334-4408 | example.com | val1    | 1   | 2025-01-25
   ```
   
   **Sink table definition:**
   ```
   CREATE TABLE sink_table (
     id1 STRING,
     pt_col1 STRING,
     distinct_id2 INT,
     dt STRING)
   USING iceberg
   PARTITIONED BY (dt, pt_col1)
   TBLPROPERTIES (
     'format' = 'iceberg/parquet',
     'format-version' = '2',
     'gc.enabled' = 'false',
     'write.metadata.delete-after-commit.enabled' = 'false',
     'write.parquet.compression-codec' = 'zstd')
   ```
   
   **Insert query:**
   ```
   INSERT OVERWRITE TABLE sink_table
       PARTITION (dt = '2025-01-26')
   SELECT id1,
          pt_col1,
          COUNT(DISTINCT id2)
   FROM source_table
   WHERE dt >= '2024-10-29'
     AND dt <= '2025-01-26'
   GROUP BY 1, 2
   ```
   
   _At this point, data in the parquet files is correct according to_ `hadoop 
jar parquet-tools-1.11.0.jar dump`_, i.e. all id1's have_ `distinct_id2 >= 1`
   
   **Data stats:**
   ```
   dt         | pt_col1 | count(1)
   2025-01-25 | val2    | 3402
   2025-01-25 | val1    | 1022
   2025-01-24 | val1    | 519
   2025-01-24 | val2    | 1777
   2025-01-26 | val2    | 5158
   2025-01-26 | val1    | 1524
   
   ```
   _Note that_ `dt='2025-01-26', val='val2'` _is the only partition combination 
where data incorrectness and SIGSEGVs are encountered, and_ `num_rows_incorrect 
= num_total_rows - 5000` _with_ `SELECT *` _. The list of id1's that are 
incorrect are the same each time the insert query is run._
   
   **The following queries return correct results without SIGSEGV:**
   ```
   SELECT [* / COUNT(*)] 
   FROM sink_table 
   WHERE distinct_id2 = 0
   ```
   _Returns:_ [0 rows / 0]
   
   ```
   SELECT COUNT(*) 
   FROM sink_table 
   WHERE dt = '2025-01-26
     AND pt_col1 = 'val2'
   ```
   _Returns:_ 5158 rows
   
   **The following query returns correct results with SIGSEGV:**
   ```
   SELECT id1, 
          count(*) 
   FROM sink_table 
   WHERE distinct_id2 = 2 
   GROUP BY 1
   ```
   _Returns:_ 1 row
   
   **The following queries return incorrect results without SIGSEGV:**
   ```
   SELECT * 
   FROM sink_table
   WHERE dt = '2025-01-26' 
     AND pt_col1 = 'val2'
   ```
   _Returns:_ 158 rows with `distinct_id2 = 0`
   _Expected:_ 0 rows with `distinct_id2 = 0`
   
   ```
   INSERT OVERWRITE TABLE sink_table_clone
   SELECT *
   FROM sink_table
   WHERE dt = '2025-01-26'
   ```
   _Writes a parquet file with_ `distinct_id2 = 0` _as the value for 158 rows_
   
   **The following queries return incorrect results with SIGSEGV:**
   ```
   SELECT * 
   FROM sink_table 
   WHERE dt = '2025-01-26
     AND pt_col1 = 'val2'
   ORDER BY distinct_id2 DESC
   ```
   _Returns:_ 158 rows with `distinct_id2 = 0`
   _Expected:_ 0 rows with `distinct_id2 = 0`
   
   ```
   SELECT id1, 
          count(*) 
   FROM sink_table 
   WHERE distinct_id2 = 1 
   GROUP BY 1
   ```
   _Returns:_ 4999 rows
   _Expected:_ 5157 rows
   
   ***Segfaults:***
   ```
   25/02/04 10:50:03 INFO Executor: Running task 0.0 in stage 249.0 (TID 197)
   25/02/04 10:50:03 INFO MapOutputTrackerWorker: Updating epoch to 89 and 
clearing cache
   25/02/04 10:50:03 INFO TorrentBroadcast: Started reading broadcast variable 
268 with 1 pieces (estimated total size 4.0 MiB)
   25/02/04 10:50:03 INFO MemoryStore: Block broadcast_268_piece0 stored as 
bytes in memory (estimated size 21.3 KiB, free 5.9 GiB)
   25/02/04 10:50:03 INFO TorrentBroadcast: Reading broadcast variable 268 took 
4 ms
   25/02/04 10:50:03 INFO MemoryStore: Block broadcast_268 stored as values in 
memory (estimated size 48.2 KiB, free 5.9 GiB)
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x0000ffff99f54ae0, pid=15, tid=282
   #
   # JRE version: OpenJDK Runtime Environment Temurin-17.0.13+11 (17.0.13+11) 
(build 17.0.13+11)
   # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.13+11 (17.0.13+11, mixed 
mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, 
linux-aarch64)
   # Problematic frame:
   # V  [libjvm.so+0xd44ae0]  ConcurrentHashTable<SymbolTableConfig, 
(MEMFLAGS)10>::Node* ConcurrentHashTable<SymbolTableConfig, 
(MEMFLAGS)10>::get_node<SymbolTableLookup>(ConcurrentHashTable<SymbolTableConfig,
 (MEMFLAGS)10>::Bucket const*, SymbolTableLookup&, bool*, unsigned long*) const 
[clone .isra.0]+0x50
   ```
   _This segfault is seen the overwhelming majority of the time. The other 
three were seen either once or twice each. The attached core dump is for this 
segfault; I wasn't able to retrieve the core dumps for the others due to the 
pod going away too quickly._
   
[core_dump.txt](https://github.com/user-attachments/files/18669734/core_dump.txt)
   
   ```
   25/02/04 09:58:36 INFO Executor: Running task 0.0 in stage 201.0 (TID 159)
   25/02/04 09:58:36 INFO TorrentBroadcast: Started reading broadcast variable 
211 with 1 pieces (estimated total size 4.0 MiB)
   25/02/04 09:58:36 INFO MemoryStore: Block broadcast_211_piece0 stored as 
bytes in memory (estimated size 12.2 KiB, free 5.9 GiB)
   25/02/04 09:58:36 INFO TorrentBroadcast: Reading broadcast variable 211 took 
4 ms
   25/02/04 09:58:36 INFO MemoryStore: Block broadcast_211 stored as values in 
memory (estimated size 29.0 KiB, free 5.9 GiB)
   25/02/04 09:58:36 INFO TransportClientFactory: Found inactive connection to 
spark-2f557594cfd0870d-driver-svc.spark-dev.svc/10.3.26.54:7078, creating a new 
one.
   25/02/04 09:58:36 INFO TransportClientFactory: Successfully created 
connection to spark-2f557594cfd0870d-driver-svc.spark-dev.svc/10.3.26.54:7078 
after 0 ms (0 ms spent in bootstraps)
   25/02/04 09:58:36 INFO CodeGenerator: Code generated in 46.470428 ms
   25/02/04 09:58:36 INFO SerializableTableWithSize: Releasing resources
   25/02/04 09:58:36 INFO SerializableTableWithSize: Releasing resources
   25/02/04 09:58:36 INFO CodeGenerator: Code generated in 15.156982 ms
   25/02/04 09:58:36 INFO TorrentBroadcast: Started reading broadcast variable 
210 with 1 pieces (estimated total size 4.0 MiB)
   25/02/04 09:58:36 INFO MemoryStore: Block broadcast_210_piece0 stored as 
bytes in memory (estimated size 30.8 KiB, free 5.8 GiB)
   25/02/04 09:58:36 INFO TorrentBroadcast: Reading broadcast variable 210 took 
4 ms
   25/02/04 09:58:36 INFO MemoryStore: Block broadcast_210 stored as values in 
memory (estimated size 32.0 KiB, free 5.8 GiB)
   25/02/04 09:58:36 INFO S3AInputStream: Switching to Random IO seek policy
   25/02/04 09:58:36 INFO CodecPool: Got brand-new decompressor [.zstd]
   25/02/04 09:58:36 INFO S3AInputStream: Switching to Random IO seek policy
   25/02/04 09:58:36 INFO CodecPool: Got brand-new decompressor [.zstd]
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x0000ffff9093553c, pid=17, tid=162
   #
   # JRE version: OpenJDK Runtime Environment Temurin-17.0.13+11 (17.0.13+11) 
(build 17.0.13+11)
   # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.13+11 (17.0.13+11, mixed 
mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, 
linux-aarch64)
   # Problematic frame:
   # j  
org.apache.iceberg.shaded.io.netty.util.internal.InternalThreadLocalMap.slowGet()Lorg/apache/iceberg/shaded/io/netty/util/internal/InternalThreadLocalMap;+0
   ```
   
   ```
   25/02/04 12:39:14 INFO Executor: Running task 0.0 in stage 286.3 (TID 233)
   25/02/04 12:39:14 INFO MapOutputTrackerWorker: Updating epoch to 106 and 
clearing cache
   25/02/04 12:39:14 INFO TorrentBroadcast: Started reading broadcast variable 
333 with 1 pieces (estimated total size 4.0 MiB)
   25/02/04 12:39:14 INFO MemoryStore: Block broadcast_333_piece0 stored as 
bytes in memory (estimated size 18.0 KiB, free 5.9 GiB)
   25/02/04 12:39:14 INFO TorrentBroadcast: Reading broadcast variable 333 took 
13 ms
   25/02/04 12:39:14 INFO MemoryStore: Block broadcast_333 stored as values in 
memory (estimated size 39.2 KiB, free 5.9 GiB)
   25/02/04 12:39:14 INFO MapOutputTrackerWorker: Don't have map outputs for 
shuffle 79, fetching them
   25/02/04 12:39:14 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
endpoint = 
NettyRpcEndpointRef(spark://mapoutputtrac...@spark-2f557594cfd0870d-driver-svc.spark-dev.svc:7078)
   25/02/04 12:39:14 INFO MapOutputTrackerWorker: Got the map output locations
   25/02/04 12:39:14 INFO ShuffleBlockFetcherIterator: Getting 200 (218.1 KiB) 
non-empty blocks including 200 (218.1 KiB) local and 0 (0.0 B) host-local and 0 
(0.0 B) push-merged-local and 0 (0.0 B) remote blocks
   25/02/04 12:39:14 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
in 13 ms
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x0000ffff7f7fbee0, pid=15, tid=59
   #
   # JRE version: OpenJDK Runtime Environment Temurin-17.0.13+11 (17.0.13+11) 
(build 17.0.13+11)
   # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.13+11 (17.0.13+11, mixed 
mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, 
linux-aarch64)
   # Problematic frame:
   # V  [libjvm.so+0x7cbee0]  
AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<282726ul, 
G1BarrierSet>, (AccessInternal::BarrierType)3, 
282726ul>::oop_access_barrier(oopDesc*, long)+0x0
   ```
   
   ```
   25/02/04 12:30:02 INFO Executor: Running task 0.0 in stage 286.0 (TID 224)
   25/02/04 12:30:02 INFO MapOutputTrackerWorker: Updating epoch to 100 and 
clearing cache
   25/02/04 12:30:02 INFO TorrentBroadcast: Started reading broadcast variable 
327 with 1 pieces (estimated total size 4.0 MiB)
   25/02/04 12:30:02 INFO MemoryStore: Block broadcast_327_piece0 stored as 
bytes in memory (estimated size 18.0 KiB, free 5.9 GiB)
   25/02/04 12:30:02 INFO TorrentBroadcast: Reading broadcast variable 327 took 
6 ms
   25/02/04 12:30:02 INFO MemoryStore: Block broadcast_327 stored as values in 
memory (estimated size 39.2 KiB, free 5.9 GiB)
   25/02/04 12:30:02 INFO MapOutputTrackerWorker: Don't have map outputs for 
shuffle 79, fetching them
   25/02/04 12:30:02 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
endpoint = 
NettyRpcEndpointRef(spark://mapoutputtrac...@spark-2f557594cfd0870d-driver-svc.spark-dev.svc:7078)
   25/02/04 12:30:02 INFO MapOutputTrackerWorker: Got the map output locations
   25/02/04 12:30:02 INFO ShuffleBlockFetcherIterator: Getting 200 (218.1 KiB) 
non-empty blocks including 200 (218.1 KiB) local and 0 (0.0 B) host-local and 0 
(0.0 B) push-merged-local and 0 (0.0 B) remote blocks
   25/02/04 12:30:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
in 2 ms
   25/02/04 12:30:02 INFO CodeGenerator: Code generated in 16.769295 ms
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x0000ffff9421465c, pid=15, tid=39
   #
   # JRE version: OpenJDK Runtime Environment Temurin-17.0.13+11 (17.0.13+11) 
(build 17.0.13+11)
   # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.13+11 (17.0.13+11, mixed 
mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, 
linux-aarch64)
   # Problematic frame:
   # V  [libjvm.so+0x6b465c]  void 
WeakProcessor::Task::work<G1STWIsAliveClosure, G1KeepAliveClosure>(unsigned 
int, G1STWIsAliveClosure*, G1KeepAliveClosure*)+0x1fc
   ```
   
   Please let me know if I can be of further assistance.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Incorrect Results and SIGSEGV on Read with Iceberg + PySpark + Nessie [iceberg]

Reply via email to