[I] Does arrow support access S3 based on 'path-style'? [arrow]

via GitHub Thu, 24 Apr 2025 02:41:32 -0700


squalud opened a new issue, #46220:
URL: https://github.com/apache/arrow/issues/46220


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I use Alluxio's proxy to provide S3 interface access. 
   
   By setting `spark.hadoop.fs.ks3.endpoint` to 
`http://<alluxio-proxy-service-name>:39999/api/v1/s3/` and setting the 
`spark.hadoop.fs.s3a.path.style.access` parameter to `true` to use `path-style` 
to access S3, I can use pyspark to successfully read csv files through the URL 
format of `s3a://data/tmp/file.csv`;
   ```
   from pyspark.sql import SparkSession
   
   spark = SparkSession.builder \
       .appName("Read S3 Data in PySpark") \
       .remote("sc://xx.xx.xx.xx:15002") \
       .config("", "http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
       .config("spark.hadoop.fs.s3a.path.style.access", "true") \
       .getOrCreate()
   
   csv_path = "s3a://data/tmp/file.csv"
   df_csv = spark.read.csv(csv_path, header=True)
   df_csv.show()
   ```
   
   But when I set `spark.gluten.sql.native.arrow.reader.enabled` to `true` to 
use arrow's reader to read, I get an error: 
   
   ```
   from pyspark.sql import SparkSession
   
   spark = SparkSession.builder \
       .appName("Read S3 Data in PySpark") \
       .remote("sc://xx.xx.xx.xx:15002") \
       .config("", "http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
       .config("spark.hadoop.fs.s3a.path.style.access", "true") \
       .config("spark.gluten.enabled", "true") \
       .config("spark.gluten.sql.native.arrow.reader.enabled", "true") \
       .config("spark.plugins", "org.apache.gluten.GlutenPlugin") \
       .getOrCreate()
   
   csv_path = "s3a://data/tmp/file.csv"
   df_csv = spark.read.csv(csv_path, header=True)
   df_csv.show()
   ```
   
   ```
   SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due 
to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost 
task 0.3 in stage 6.0 (TID 27) (xx.xx.xx.xx executor 2): 
org.apache.gluten.exception.GlutenException: 
org.apache.gluten.exception.GlutenException: Error during calling Java code 
from native code: org.apache.gluten.exception.GlutenException: 
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
   Error Source: RUNTIME
   Error Code: INVALID_STATE
   Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 
0]: Error during calling Java code from native code: 
java.lang.RuntimeException: When getting information for key 'tmp/file.csv' in 
bucket 'data': AWS Error ACCESS_DENIED during HeadObject operation: No response 
body.
        at 
org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native 
Method)
        at 
org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:53)
        at 
org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:34)
        at 
org.apache.gluten.utils.ArrowUtil$.makeArrowDiscovery(ArrowUtil.scala:128)
        at 
org.apache.gluten.utils.ArrowUtil$.readArrowSchema(ArrowUtil.scala:139)
        at 
org.apache.gluten.utils.ArrowUtil$.readArrowFileColumnNames(ArrowUtil.scala:152)
        at 
org.apache.gluten.datasource.ArrowCSVFileFormat.$anonfun$buildReaderWithPartitionValues$3(ArrowCSVFileFormat.scala:128)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
        at 
org.apache.spark.sql.execution.ArrowFileSourceScanExec$$anon$1.hasNext(ArrowFileSourceScanExec.scala:48)
        at 
org.apache.gluten.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127)
        at scala.collection.Iterator$$anon$10.hasNext(I...
   ```
   
   
   It seems that Arrow's reader treats the first-level path `data` as  the 
bucket, that is, the configuration `spark.hadoop.fs.s3a.path.style.access` does 
not take effect on Arrow. How can I use arrow's reader to access S3 based on 
`path-style` just like Spark's original reader?
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Does arrow support access S3 based on 'path-style'? [arrow]

Reply via email to