[ 
https://issues.apache.org/jira/browse/SPARK-52507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-52507:
----------------------------------
    Description: 
Using the fallback storage with storage decommissioning on Kubernetes can run 
into the situation where some tasks try to read from an executor that has just 
been decommissioned. The driver has updated location information of the 
migrated shuffle data, but the task uses the outdated location.

Given we have the fallback storage enabled and shuffle data is always migrated 
to the fallback storage only (SPARK-52506), it is very likely that a fetch 
failure can be recovered from the fallback storage. The task does not need to 
go through a fetch failure to restart the task or stage to get hold of the 
update shuffle data location.

This benefits from

1. connections to decommissioned executors to quickly fail (connection refused 
rather connection timeout), see SPARK-52505
2. storage migration only migrates to the fallback storage, see SPARK-52506

  was:
Using the fallback storage with storage decommissioning on Kubernetes can run 
into the situation where some tasks try to read from an executor that has just 
been decommissioned. The driver has updated location information of the 
migrated shuffle data, but the task uses the outdated location.

Given we have the fallback storage enabled and shuffle data is always migrated 
to the fallback storage only (SPARK-52506), it is very likely that a fetch 
failure can be recovered from the fallback storage. The task does not need to 
go through a fetch failure to restart the task or stage to get hold of the 
update shuffle data location.

This requires

1. connections to decommissioned executors to quickly fail (connection refused 
rather connection timeout), see SPARK-52505
2. storage migration only migrates to the fallback storage, see SPARK-52506


> Quick fallback to fallback storage on fetch failure
> ---------------------------------------------------
>
>                 Key: SPARK-52507
>                 URL: https://issues.apache.org/jira/browse/SPARK-52507
>             Project: Spark
>          Issue Type: Sub-task
>          Components: k8s, Kubernetes
>    Affects Versions: 4.1.0
>            Reporter: Enrico Minack
>            Priority: Major
>
> Using the fallback storage with storage decommissioning on Kubernetes can run 
> into the situation where some tasks try to read from an executor that has 
> just been decommissioned. The driver has updated location information of the 
> migrated shuffle data, but the task uses the outdated location.
> Given we have the fallback storage enabled and shuffle data is always 
> migrated to the fallback storage only (SPARK-52506), it is very likely that a 
> fetch failure can be recovered from the fallback storage. The task does not 
> need to go through a fetch failure to restart the task or stage to get hold 
> of the update shuffle data location.
> This benefits from
> 1. connections to decommissioned executors to quickly fail (connection 
> refused rather connection timeout), see SPARK-52505
> 2. storage migration only migrates to the fallback storage, see SPARK-52506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to