[jira] [Updated] (HADOOP-19729) ABFS: [Perf] Network Profiling of Tailing Requests and Killing Bad Connections Proactively

Anuj Modi (Jira) Tue, 28 Oct 2025 10:34:04 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-19729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anuj Modi updated HADOOP-19729:
-------------------------------
    Description: 
It has been observed that certain requests taking more time than expected to 
complete hinders the performance of whole workload. Such requests are known as 
tailing requests. They can be taking more time due to a number of reasons and 
the prominent among them is a bad network connection. In Abfs driver we cache 
network connections and keeping such bad connections in cache and reusing them 
can be bad for perf.

In this effort we try to identify such connections and close them so that new 
good connetions can be established and perf can be improved. There are two 
parts of this effort.
 # Identifying Tailing Requests: This involves profiling all the network calls 
and getting percentiles value optimally. By default we consider p99 as the tail 
latency and all the future requests taking more than tail latency will be 
considere as Tailing requests.

 # Proactively Killing Socket Connections: With Apache client, we can now kill 
the socket connection and fail the tailing request. Such failures will not be 
thrown back to user and retried immediately without any sleep but from another 
socket connection.

> ABFS: [Perf] Network Profiling of Tailing Requests and Killing Bad 
> Connections Proactively
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-19729
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19729
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>    Affects Versions: 3.4.2
>            Reporter: Anuj Modi
>            Assignee: Anuj Modi
>            Priority: Major
>              Labels: pull-request-available
>
> It has been observed that certain requests taking more time than expected to 
> complete hinders the performance of whole workload. Such requests are known 
> as tailing requests. They can be taking more time due to a number of reasons 
> and the prominent among them is a bad network connection. In Abfs driver we 
> cache network connections and keeping such bad connections in cache and 
> reusing them can be bad for perf.
> In this effort we try to identify such connections and close them so that new 
> good connetions can be established and perf can be improved. There are two 
> parts of this effort.
>  # Identifying Tailing Requests: This involves profiling all the network 
> calls and getting percentiles value optimally. By default we consider p99 as 
> the tail latency and all the future requests taking more than tail latency 
> will be considere as Tailing requests.
>  # Proactively Killing Socket Connections: With Apache client, we can now 
> kill the socket connection and fail the tailing request. Such failures will 
> not be thrown back to user and retried immediately without any sleep but from 
> another socket connection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-19729) ABFS: [Perf] Network Profiling of Tailing Requests and Killing Bad Connections Proactively

Reply via email to