[I] java.lang.UnsatisfiedLinkError when reading CSV from S3 by arrow's csv reader [arrow]

via GitHub Sun, 20 Apr 2025 19:07:21 -0700


squalud opened a new issue, #46185:
URL: https://github.com/apache/arrow/issues/46185


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I build gluten by source code using the following command, which will also 
build arrow:
   `./dev/buildbundle-veloxbe.sh --enable_hdfs=ON --enable_s3=ON 
--enable_vcpkg=ON --spark_version=3.5`
   
   After successfully build，i run pyspark with using arrow's S3 csv reader to 
read csv file on S3, then i got a `java.lang.UnsatisfiedLinkError`:
   
   ```
   SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due 
to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost 
task 0.3 in stage 0.0 (TID 3) (xx.xx.xx.xx executor 1): 
org.apache.gluten.exception.GlutenException: 
org.apache.gluten.exception.GlutenException: Error during calling Java code 
from native code: org.apache.gluten.exception.GlutenException: 
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
   Error Source: RUNTIME
   Error Code: INVALID_STATE
   Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 
0]: Error during calling Java code from native code: 
java.lang.UnsatisfiedLinkError: /tmp/jnilib-17305424060615380389.tmp: 
/tmp/jnilib-17305424060615380389.tmp: undefined symbol: 
_ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
   at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
   at 
java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:388)
   at 
java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:232)
   at 
java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:174)
   at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2394)
   at java.base/java.lang.Runtime.load0(Runtime.java:755)
   at java.base/java.lang.System.load(System.java:1970)
   at org.apache.arrow.dataset.jni.JniLoader.load(JniLoader.java:92)
   at org.apache.arrow.dataset.jni.JniLoader.loadRemaining(JniLoader.java:75)
   at org.apache.arrow.dataset.jni.JniLoader.ensureLoaded(JniLoader.java:61)
   at 
org.apache.arrow.dataset.jni.NativeMemoryPool.createListenable(NativeMemoryPool.java:44)
   at 
org.apache.gluten.memory.arrow.pool.ArrowNativeMemoryPool.(ArrowNativeMemoryPool.java:34)
   at 
org.apache.gluten.memory.arrow.pool.ArrowNativeMemoryPool.createArrowNativeMemoryPool(ArrowNativeMemoryPool.java:47)
   at 
org.apache.gluten.memory.arrow.pool.ArrowNativeMemoryPool.lambda$arrowPool$0(ArrowNativeMemoryPool.java:42)
   at 
org.apache.spark.task.TaskResourceRegistry.$anonfun$addResourceIfNotRegistered$1(Task...
   ```
   
   The following is my build log, in which all the `ARROW_S3` option in the 
build message are switch to `ON`:
   ```
   ......
   
   + pushd /workspace/incubator-gluten/dev/../ep/_ep/arrow_ep/cpp
   /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp 
/workspace/incubator-gluten/dev
   + cmake_install -DARROW_S3=ON -DARROW_PARQUET=ON -DARROW_FILESYSTEM=ON 
-DARROW_PROTOBUF_USE_SHARED=OFF -DARROW_DEPENDENCY_USE_SHARED=OFF 
-DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_WITH_THRIFT=ON -DARROW_WITH_LZ4=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON 
-DARROW_JEMALLOC=OFF -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE 
-DARROW_WITH_UTF8PROC=OFF -DARROW_TESTING=ON -DCMAKE_INSTALL_PREFIX=/usr/local 
-DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON
   
   ......
   
   + COMPILER_FLAGS='-mavx2 -mfma -mavx -mf16c -mlzcnt -std=c++17 -mbmi2 '
   + cmake -Wno-dev -B_build -GNinja -DCMAKE_POSITION_INDEPENDENT_CODE=ON 
-DCMAKE_CXX_STANDARD=17 '' '' '-DCMAKE_CXX_FLAGS=-mavx2 -mfma -mavx -mf16c 
-mlzcnt -std=c++17 -mbmi2 ' -DBUILD_TESTING=OFF -DARROW_S3=ON 
-DARROW_PARQUET=ON -DARROW_FILESYSTEM=ON -DARROW_PROTOBUF_USE_SHARED=OFF 
-DARROW_DEPENDENCY_USE_SHARED=OFF -DARROW_DEPENDENCY_SOURCE=BUNDLED 
-DARROW_WITH_THRIFT=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON 
-DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_JEMALLOC=OFF 
-DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE 
-DARROW_WITH_UTF8PROC=OFF -DARROW_TESTING=ON -DCMAKE_INSTALL_PREFIX=/usr/local 
-DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON
   
   ......
   
   -- ---------------------------------------------------------------------
   -- Arrow version:                                 15.0.0
   --
   -- Build configuration summary:
   --   Generator: Ninja
   --   Build type: RELEASE
   --   Source directory: /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp
   --   Install prefix: /usr/local
   --
   -- Compile and link options:
   --
   --   ARROW_CXXFLAGS="" [default=""]
   --       Compiler flags to append when compiling Arrow
   --   ARROW_BUILD_STATIC=ON [default=ON
   
   ......
   
   --   ARROW_ACERO=OFF [default=OFF]
   --       Build the Arrow Acero Engine Module
   --   ARROW_AZURE=OFF [default=OFF]
   --       Build Arrow with Azure support (requires the Azure SDK for C++)
   --   ARROW_BUILD_UTILITIES=OFF [default=OFF]
   --       Build Arrow commandline utilities
   --   ARROW_COMPUTE=OFF [default=OFF]
   --       Build all Arrow Compute kernels
   --   ARROW_CSV=OFF [default=OFF]
   --       Build the Arrow CSV Parser Module
   --   ARROW_CUDA=OFF [default=OFF]
   --       Build the Arrow CUDA extensions (requires CUDA toolkit)
   --   ARROW_DATASET=OFF [default=OFF]
   --       Build the Arrow Dataset Modules
   --   ARROW_FILESYSTEM=ON [default=OFF]
   --       Build the Arrow Filesystem Layer
   --   ARROW_FLIGHT=OFF [default=OFF]
   --       Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)
   --   ARROW_FLIGHT_SQL=OFF [default=OFF]
   --       Build the Arrow Flight SQL extension
   --   ARROW_GANDIVA=OFF [default=OFF]
   --       Build the Gandiva libraries
   --   ARROW_GCS=OFF [default=OFF]
   --       Build Arrow with GCS support (requires the GCloud SDK for C++)
   --   ARROW_HDFS=OFF [default=OFF]
   --       Build the Arrow HDFS bridge
   --   ARROW_IPC=ON [default=ON]
   --       Build the Arrow IPC extensions
   --   ARROW_JEMALLOC=OFF [default=ON]
   --       Build the Arrow jemalloc-based allocator
   --   ARROW_JSON=ON [default=OFF]
   --       Build Arrow with JSON support (requires RapidJSON)
   --   ARROW_MIMALLOC=OFF [default=OFF]
   --       Build the Arrow mimalloc-based allocator
   --   ARROW_PARQUET=ON [default=OFF]
   --       Build the Parquet libraries
   --   ARROW_ORC=OFF [default=OFF]
   --       Build the Arrow ORC adapter
   --   ARROW_PYTHON=OFF [default=OFF]
   --       Build some components needed by PyArrow.
   --       (This is a deprecated option. Use CMake presets instead.)
   --   ARROW_S3=ON [default=OFF]
   --       Build Arrow with S3 support (requires the AWS SDK for C++)
   --   ARROW_SKYHOOK=OFF [default=OFF]
   --       Build the Skyhook libraries
   --   ARROW_SUBSTRAIT=OFF [default=OFF]
   --       Build the Arrow Substrait Consumer Module
   --   ARROW_TENSORFLOW=OFF [default=OFF]
   --       Build Arrow with TensorFlow support enabled
   --   ARROW_TESTING=ON [default=OFF]
   --       Build the Arrow testing libraries
   
   ......
   
   -- ---------------------------------------------------------------------
   -- Arrow version:                                 15.0.0
   --
   -- Build configuration summary:
   --   Generator: Unix Makefiles
   --   Build type: RELEASE
   --   Source directory: /workspace/incubator-gluten/ep/_ep/arrow_ep/cpp
   --   Install prefix: /workspace/incubator-gluten/ep/_ep/arrow_ep/java-dist
   --
   -- Compile and link options:
   --
   --   ARROW_CXXFLAGS="" [default=""]
   --       Compiler flags to append when compiling Arrow
   --   ARROW_BUILD_STATIC=ON [default=ON]
   --       Build static libraries
   
   ......
   
   -- Project component options:
   --
   --   ARROW_ACERO=ON [default=OFF]
   --       Build the Arrow Acero Engine Module
   --   ARROW_AZURE=OFF [default=OFF]
   --       Build Arrow with Azure support (requires the Azure SDK for C++)
   --   ARROW_BUILD_UTILITIES=OFF [default=OFF]
   --       Build Arrow commandline utilities
   --   ARROW_COMPUTE=ON [default=OFF]
   --       Build all Arrow Compute kernels
   --   ARROW_CSV=ON [default=OFF]
   --       Build the Arrow CSV Parser Module
   --   ARROW_CUDA=OFF [default=OFF]
   --       Build the Arrow CUDA extensions (requires CUDA toolkit)
   --   ARROW_DATASET=ON [default=OFF]
   --       Build the Arrow Dataset Modules
   --   ARROW_FILESYSTEM=ON [default=OFF]
   --       Build the Arrow Filesystem Layer
   --   ARROW_FLIGHT=OFF [default=OFF]
   --       Build the Arrow Flight RPC System (requires GRPC, Protocol Buffers)
   --   ARROW_FLIGHT_SQL=OFF [default=OFF]
   --       Build the Arrow Flight SQL extension
   --   ARROW_GANDIVA=OFF [default=OFF]
   --       Build the Gandiva libraries
   --   ARROW_GCS=OFF [default=OFF]
   --       Build Arrow with GCS support (requires the GCloud SDK for C++)
   --   ARROW_HDFS=ON [default=OFF]
   --       Build the Arrow HDFS bridge
   --   ARROW_IPC=ON [default=ON]
   --       Build the Arrow IPC extensions
   --   ARROW_JEMALLOC=ON [default=ON]
   --       Build the Arrow jemalloc-based allocator
   --   ARROW_JSON=ON [default=OFF]
   --       Build Arrow with JSON support (requires RapidJSON)
   --   ARROW_MIMALLOC=OFF [default=OFF]
   --       Build the Arrow mimalloc-based allocator
   --   ARROW_PARQUET=ON [default=OFF]
   --       Build the Parquet libraries
   --   ARROW_ORC=OFF [default=OFF]
   --       Build the Arrow ORC adapter
   --   ARROW_PYTHON=OFF [default=OFF]
   --       Build some components needed by PyArrow.
   --       (This is a deprecated option. Use CMake presets instead.)
   --   ARROW_S3=ON [default=OFF]
   --       Build Arrow with S3 support (requires the AWS SDK for C++)
   --   ARROW_SKYHOOK=OFF [default=OFF]
   --       Build the Skyhook libraries
   --   ARROW_SUBSTRAIT=ON [default=OFF]
   --       Build the Arrow Substrait Consumer Module
   --   ARROW_TENSORFLOW=OFF [default=OFF]
   --       Build Arrow with TensorFlow support enabled
   --   ARROW_TESTING=OFF [default=OFF]
   --       Build the Arrow testing libraries
   ......
   
   ```
   
   When i login the spark executor, check by `nm` command:
   
   ```
   $ nm -D /tmp/jnilib-17305424060615380389.tmp |grep 
_ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
                    U 
_ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
   
   ```
   
   After extract the gluten jar, i got these libs:
   
   ```
   $ find ./ -name *.so
   ./linux/amd64/libvelox.so
   ./linux/amd64/libgluten.so
   ./x86_64/libarrow_cdata_jni.so
   ./x86_64/libarrow_dataset_jni.so
   $ nm -D x86_64/libarrow_dataset_jni.so |grep 
_ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
                    U 
_ZNK3Aws2S38S3Client13CreateSessionERKNS0_5Model20CreateSessionRequestE
   ```
   
   It looks like the aws-cpp-sdk-s3 library is not statically linked in? Or do 
i need to install the related libs of aws-sdk in my Dockfile manually?
   
   How can i work round?
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] java.lang.UnsatisfiedLinkError when reading CSV from S3 by arrow's csv reader [arrow]

Reply via email to