[
https://issues.apache.org/jira/browse/SPARK-52507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enrico Minack updated SPARK-52507:
----------------------------------
Description:
Using the fallback storage with storage decommissioning on Kubernetes can run
into the situation where some tasks try to read from an executor that has just
been decommissioned. The driver has updated location information of the
migrated shuffle data, but the task uses the outdated location.
Given we have the fallback storage enabled and shuffle data is always migrated
to the fallback storage only (SPARK-52506), it is very likely that a fetch
failure can be recovered from the fallback storage. The task does not need to
go through a fetch failure to restart the task or stage to get hold of the
update shuffle data location.
This benefits from
1. connections to decommissioned executors to quickly fail (connection refused
rather connection timeout), see SPARK-52505
2. storage migration only migrates to the fallback storage, see SPARK-52506
was:
Using the fallback storage with storage decommissioning on Kubernetes can run
into the situation where some tasks try to read from an executor that has just
been decommissioned. The driver has updated location information of the
migrated shuffle data, but the task uses the outdated location.
Given we have the fallback storage enabled and shuffle data is always migrated
to the fallback storage only (SPARK-52506), it is very likely that a fetch
failure can be recovered from the fallback storage. The task does not need to
go through a fetch failure to restart the task or stage to get hold of the
update shuffle data location.
This requires
1. connections to decommissioned executors to quickly fail (connection refused
rather connection timeout), see SPARK-52505
2. storage migration only migrates to the fallback storage, see SPARK-52506
> Quick fallback to fallback storage on fetch failure
> ---------------------------------------------------
>
> Key: SPARK-52507
> URL: https://issues.apache.org/jira/browse/SPARK-52507
> Project: Spark
> Issue Type: Sub-task
> Components: k8s, Kubernetes
> Affects Versions: 4.1.0
> Reporter: Enrico Minack
> Priority: Major
>
> Using the fallback storage with storage decommissioning on Kubernetes can run
> into the situation where some tasks try to read from an executor that has
> just been decommissioned. The driver has updated location information of the
> migrated shuffle data, but the task uses the outdated location.
> Given we have the fallback storage enabled and shuffle data is always
> migrated to the fallback storage only (SPARK-52506), it is very likely that a
> fetch failure can be recovered from the fallback storage. The task does not
> need to go through a fetch failure to restart the task or stage to get hold
> of the update shuffle data location.
> This benefits from
> 1. connections to decommissioned executors to quickly fail (connection
> refused rather connection timeout), see SPARK-52505
> 2. storage migration only migrates to the fallback storage, see SPARK-52506
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]