squalud opened a new issue, #46220: URL: https://github.com/apache/arrow/issues/46220
### Describe the usage question you have. Please include as many useful details as possible. I use Alluxio's proxy to provide S3 interface access. By setting `spark.hadoop.fs.ks3.endpoint` to `http://<alluxio-proxy-service-name>:39999/api/v1/s3/` and setting the `spark.hadoop.fs.s3a.path.style.access` parameter to `true` to use `path-style` to access S3, I can use pyspark to successfully read csv files through the URL format of `s3a://data/tmp/file.csv`; ``` from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Read S3 Data in PySpark") \ .remote("sc://xx.xx.xx.xx:15002") \ .config("", "http://<alluxio-proxy-service-name>:39999/api/v1/s3/") .config("spark.hadoop.fs.s3a.path.style.access", "true") \ .getOrCreate() csv_path = "s3a://data/tmp/file.csv" df_csv = spark.read.csv(csv_path, header=True) df_csv.show() ``` But when I set `spark.gluten.sql.native.arrow.reader.enabled` to `true` to use arrow's reader to read, I get an error: ``` from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Read S3 Data in PySpark") \ .remote("sc://xx.xx.xx.xx:15002") \ .config("", "http://<alluxio-proxy-service-name>:39999/api/v1/s3/") .config("spark.hadoop.fs.s3a.path.style.access", "true") \ .config("spark.gluten.enabled", "true") \ .config("spark.gluten.sql.native.arrow.reader.enabled", "true") \ .config("spark.plugins", "org.apache.gluten.GlutenPlugin") \ .getOrCreate() csv_path = "s3a://data/tmp/file.csv" df_csv = spark.read.csv(csv_path, header=True) df_csv.show() ``` ``` SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 27) (xx.xx.xx.xx executor 2): org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Error during calling Java code from native code: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError Error Source: RUNTIME Error Code: INVALID_STATE Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: java.lang.RuntimeException: When getting information for key 'tmp/file.csv' in bucket 'data': AWS Error ACCESS_DENIED during HeadObject operation: No response body. at org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native Method) at org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:53) at org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:34) at org.apache.gluten.utils.ArrowUtil$.makeArrowDiscovery(ArrowUtil.scala:128) at org.apache.gluten.utils.ArrowUtil$.readArrowSchema(ArrowUtil.scala:139) at org.apache.gluten.utils.ArrowUtil$.readArrowFileColumnNames(ArrowUtil.scala:152) at org.apache.gluten.datasource.ArrowCSVFileFormat.$anonfun$buildReaderWithPartitionValues$3(ArrowCSVFileFormat.scala:128) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129) at org.apache.spark.sql.execution.ArrowFileSourceScanExec$$anon$1.hasNext(ArrowFileSourceScanExec.scala:48) at org.apache.gluten.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127) at scala.collection.Iterator$$anon$10.hasNext(I... ``` It seems that Arrow's reader treats the first-level path `data` as the bucket, that is, the configuration `spark.hadoop.fs.s3a.path.style.access` does not take effect on Arrow. How can I use arrow's reader to access S3 based on `path-style` just like Spark's original reader? ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org