[
https://issues.apache.org/jira/browse/HADOOP-19729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anuj Modi updated HADOOP-19729:
-------------------------------
Description:
It has been observed that certain requests taking more time than expected to
complete hinders the performance of whole workload. Such requests are known as
tailing requests. They can be taking more time due to a number of reasons and
the prominent among them is a bad network connection. In Abfs driver we cache
network connections and keeping such bad connections in cache and reusing them
can be bad for perf.
In this effort we try to identify such connections and close them so that new
good connetions can be established and perf can be improved. There are two
parts of this effort.
# Identifying Tailing Requests: This involves profiling all the network calls
and getting percentiles value optimally. By default we consider p99 as the tail
latency and all the future requests taking more than tail latency will be
considere as Tailing requests.
# Proactively Killing Socket Connections: With Apache client, we can now kill
the socket connection and fail the tailing request. Such failures will not be
thrown back to user and retried immediately without any sleep but from another
socket connection.
> ABFS: [Perf] Network Profiling of Tailing Requests and Killing Bad
> Connections Proactively
> ------------------------------------------------------------------------------------------
>
> Key: HADOOP-19729
> URL: https://issues.apache.org/jira/browse/HADOOP-19729
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/azure
> Affects Versions: 3.4.2
> Reporter: Anuj Modi
> Assignee: Anuj Modi
> Priority: Major
> Labels: pull-request-available
>
> It has been observed that certain requests taking more time than expected to
> complete hinders the performance of whole workload. Such requests are known
> as tailing requests. They can be taking more time due to a number of reasons
> and the prominent among them is a bad network connection. In Abfs driver we
> cache network connections and keeping such bad connections in cache and
> reusing them can be bad for perf.
> In this effort we try to identify such connections and close them so that new
> good connetions can be established and perf can be improved. There are two
> parts of this effort.
> # Identifying Tailing Requests: This involves profiling all the network
> calls and getting percentiles value optimally. By default we consider p99 as
> the tail latency and all the future requests taking more than tail latency
> will be considere as Tailing requests.
> # Proactively Killing Socket Connections: With Apache client, we can now
> kill the socket connection and fail the tailing request. Such failures will
> not be thrown back to user and retried immediately without any sleep but from
> another socket connection.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]