squalud opened a new issue, #46220:
URL: https://github.com/apache/arrow/issues/46220
### Describe the usage question you have. Please include as many useful
details as possible.
I use Alluxio's proxy to provide S3 interface access.
By setting `spark.hadoop.fs.ks3.endpoint` to
`http://<alluxio-proxy-service-name>:39999/api/v1/s3/` and setting the
`spark.hadoop.fs.s3a.path.style.access` parameter to `true` to use `path-style`
to access S3, I can use pyspark to successfully read csv files through the URL
format of `s3a://data/tmp/file.csv`;
```
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Read S3 Data in PySpark") \
.remote("sc://xx.xx.xx.xx:15002") \
.config("", "http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.getOrCreate()
csv_path = "s3a://data/tmp/file.csv"
df_csv = spark.read.csv(csv_path, header=True)
df_csv.show()
```
But when I set `spark.gluten.sql.native.arrow.reader.enabled` to `true` to
use arrow's reader to read, I get an error:
```
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Read S3 Data in PySpark") \
.remote("sc://xx.xx.xx.xx:15002") \
.config("", "http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.gluten.enabled", "true") \
.config("spark.gluten.sql.native.arrow.reader.enabled", "true") \
.config("spark.plugins", "org.apache.gluten.GlutenPlugin") \
.getOrCreate()
csv_path = "s3a://data/tmp/file.csv"
df_csv = spark.read.csv(csv_path, header=True)
df_csv.show()
```
```
SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due
to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost
task 0.3 in stage 6.0 (TID 27) (xx.xx.xx.xx executor 2):
org.apache.gluten.exception.GlutenException:
org.apache.gluten.exception.GlutenException: Error during calling Java code
from native code: org.apache.gluten.exception.GlutenException:
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID:
0]: Error during calling Java code from native code:
java.lang.RuntimeException: When getting information for key 'tmp/file.csv' in
bucket 'data': AWS Error ACCESS_DENIED during HeadObject operation: No response
body.
at
org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native
Method)
at
org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:53)
at
org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:34)
at
org.apache.gluten.utils.ArrowUtil$.makeArrowDiscovery(ArrowUtil.scala:128)
at
org.apache.gluten.utils.ArrowUtil$.readArrowSchema(ArrowUtil.scala:139)
at
org.apache.gluten.utils.ArrowUtil$.readArrowFileColumnNames(ArrowUtil.scala:152)
at
org.apache.gluten.datasource.ArrowCSVFileFormat.$anonfun$buildReaderWithPartitionValues$3(ArrowCSVFileFormat.scala:128)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
at
org.apache.spark.sql.execution.ArrowFileSourceScanExec$$anon$1.hasNext(ArrowFileSourceScanExec.scala:48)
at
org.apache.gluten.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127)
at scala.collection.Iterator$$anon$10.hasNext(I...
```
It seems that Arrow's reader treats the first-level path `data` as the
bucket, that is, the configuration `spark.hadoop.fs.s3a.path.style.access` does
not take effect on Arrow. How can I use arrow's reader to access S3 based on
`path-style` just like Spark's original reader?
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]