inkerinmaa opened a new issue, #14817:
URL: https://github.com/apache/iceberg/issues/14817
### Apache Iceberg version
1.10.0 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
Hello.
My setup - Spark -> Iceberg Rest (in docker) -> S3
When I use MinIO as S3 - all works fine. Swithing to Cloud (S3 compatible) -
can't make it working. Cloud S3 requires checksum validation, aws cli is
working with these parameters:
`request_checksum_calculation = WHEN_REQUIRED
response_checksum_validation = WHEN_REQUIRED`
For Iceberg Rest I added env variable to docker with
apache/iceberg-rest-fixture:
`environment:
- AWS_REQUEST_CHECKSUM_CALCULATION=when_required`
And from Spark I am able now to create Tables, Namespaces - actions, that
utilize Iceberg Rest.
But I cannot Insert new data, even though I added in Spark
`spark.sql.catalog.rest.s3.checksum-enabled: "true"`
I am getting an error:
`SparkException: Job aborted due to stage failure: Task 0 in stage 0.0
failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6)
(10.42.0.130 executor 1): java.io.UncheckedIOException: Failed to close current
writer`
My Spark config:
`apiVersion: spark.apache.org/v1
kind: SparkApplication
metadata:
name: spark-connect-server
spec:
mainClass: "org.apache.spark.sql.connect.service.SparkConnectServer"
sparkConf:
spark.jars.packages:
org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.iceberg:iceberg-aws-bundle:1.10.0
# ===== IVY CONFIGURATION =====
spark.jars.ivy: /tmp/.ivy2
spark.jars.repositories: https://repo1.maven.org/maven2/
# ===== ICEBERG CONFIG =====
spark.sql.extensions:
org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.defaultCatalog: rest
spark.sql.catalog.rest: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.rest.type: rest
spark.sql.catalog.rest.uri: http://testsrv01:8181
spark.sql.catalog.rest.warehouse: s3://warehouse/
spark.sql.catalog.rest.io-impl: org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.rest.s3.region: "us-east-1"
spark.sql.catalog.rest.s3.endpoint: https://s3.yyyyy.cloud:443
spark.sql.catalog.rest.s3.access-key-id: test:[email protected]
spark.sql.catalog.rest.s3.secret-access-key: test
spark.sql.catalog.rest.s3.path-style-access: "true"
spark.sql.catalogImplementation: in-memory
spark.dynamicAllocation.enabled: "true"
spark.dynamicAllocation.shuffleTracking.enabled: "true"
spark.dynamicAllocation.minExecutors: "1"
spark.dynamicAllocation.maxExecutors: "1"
spark.kubernetes.authenticate.driver.serviceAccountName: "spark"
spark.kubernetes.container.image: "apache/spark:4.0.1"
spark.kubernetes.driver.pod.excludedFeatureSteps:
"org.apache.spark.deploy.k8s.features.KerberosConfDriverFeatureStep"
spark.sql.catalog.rest.s3.checksum-enabled: "true"
spark.driver.extraJavaOptions: "-Daws.region=us-east-1"
spark.executor.extraJavaOptions: "-Daws.region=us-east-1"
applicationTolerations:
resourceRetainPolicy: OnFailure
runtimeVersions:
scalaVersion: "2.13"
sparkVersion: "4.0.1"`
So, error appeared after switching to the Cloud S3 with checksum settings
and therefore conclusion is that it is the root cause of the error. And
available setting `spark.sql.catalog.rest.s3.checksum-enabled: "true"` has no
effect on it. Or there is another way to make it working?
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]