[jira] [Commented] (IMPALA-13164) Add metrics for RPC reactor and deserialization threads

Michael Smith (Jira) Wed, 01 Apr 2026 21:19:06 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-13164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18070484#comment-18070484
 ]


Michael Smith commented on IMPALA-13164:
----------------------------------------

Metrics from kudu::MetricRegistry look like
{code}
[{
        "name": "rpc_connections_accepted",
        "value": 6
}, {
        "name": 
"queue_overflow_rejections_impala_DataStreamService_PublishFilter",
        "value": 0
}, {
        "name": 
"handler_latency_impala_DataStreamService_UpdateFilterFromRemote",
        "total_count": 0,
        "min": 0,
        "mean": 0.0,
        "percentile_75": 0,
        "percentile_95": 0,
        "percentile_99": 0,
        "percentile_99_9": 0,
        "percentile_99_99": 0,
        "max": 0,
        "total_sum": 0
}, {
        "name": "handler_latency_impala_DataStreamService_UpdateFilter",
        "total_count": 181,
        "min": 4,
        "mean": 442.1436464088398,
        "percentile_75": 490,
        "percentile_95": 1344,
        "percentile_99": 2624,
        "percentile_99_9": 8832,
        "percentile_99_99": 8832,
        "max": 8876,
        "total_sum": 80028
}, {
        "name": "handler_latency_impala_DataStreamService_EndDataStream",
        "total_count": 347,
        "min": 1,
        "mean": 15.187319884726226,
        "percentile_75": 8,
        "percentile_95": 16,
        "percentile_99": 532,
        "percentile_99_9": 672,
        "percentile_99_99": 672,
        "max": 675,
        "total_sum": 5270
}, {
        "name": "handler_latency_impala_DataStreamService_TransmitData",
        "total_count": 18722,
        "min": 2,
        "mean": 409.2033970729623,
        "percentile_75": 25,
        "percentile_95": 48,
        "percentile_99": 796,
        "percentile_99_9": 14592,
        "percentile_99_99": 577536,
        "max": 804053,
        "total_sum": 7661106
}, {
        "name": "reactor_active_latency_us",
        "total_count": 118484,
        "min": 0,
        "mean": 11.475102123493468,
        "percentile_75": 7,
        "percentile_95": 26,
        "percentile_99": 131,
        "percentile_99_9": 1208,
        "percentile_99_99": 4080,
        "max": 9104,
        "total_sum": 1359616
}, {
        "name": 
"queue_overflow_rejections_impala_DataStreamService_TransmitData",
        "value": 0
}, {
        "name": "handler_latency_impala_ControlService_ReportExecStatus",
        "total_count": 93,
        "min": 29,
        "mean": 2546.129032258065,
        "percentile_75": 3392,
        "percentile_95": 8256,
        "percentile_99": 10112,
        "percentile_99_9": 15360,
        "percentile_99_99": 15360,
        "max": 15368,
        "total_sum": 236790
}, {
        "name": "handler_latency_impala_DataStreamService_PublishFilter",
        "total_count": 70,
        "min": 98,
        "mean": 380.3428571428572,
        "percentile_75": 480,
        "percentile_95": 1080,
        "percentile_99": 1312,
        "percentile_99_9": 1984,
        "percentile_99_99": 1984,
        "max": 1988,
        "total_sum": 26624
}, {
        "name": "reactor_load_percent",
        "total_count": 16920,
        "min": 0,
        "mean": 0.1254728132387707,
        "percentile_75": 0,
        "percentile_95": 0,
        "percentile_99": 2,
        "percentile_99_9": 25,
        "percentile_99_99": 33,
        "max": 33,
        "total_sum": 2123
}, {
        "name": 
"queue_overflow_rejections_impala_ControlService_ExecQueryFInstances",
        "value": 0
}, {
        "name": 
"queue_overflow_rejections_impala_DataStreamService_UpdateFilter",
        "value": 0
}, {
        "name": "handler_latency_impala_ControlService_RemoteShutdown",
        "total_count": 0,
        "min": 0,
        "mean": 0.0,
        "percentile_75": 0,
        "percentile_95": 0,
        "percentile_99": 0,
        "percentile_99_9": 0,
        "percentile_99_99": 0,
        "max": 0,
        "total_sum": 0
}, {
        "name": "handler_latency_impala_ControlService_ExecQueryFInstances",
        "total_count": 35,
        "min": 224,
        "mean": 4184.685714285714,
        "percentile_75": 3136,
        "percentile_95": 16192,
        "percentile_99": 37632,
        "percentile_99_9": 37632,
        "percentile_99_99": 37632,
        "max": 37884,
        "total_sum": 146464
}, {
        "name": "impala_incoming_queue_time",
        "total_count": 19482,
        "min": 2,
        "mean": 50.31377681962837,
        "percentile_75": 13,
        "percentile_95": 100,
        "percentile_99": 1038,
        "percentile_99_9": 4648,
        "percentile_99_99": 7684,
        "max": 12846,
        "total_sum": 980213
}, {
        "name": 
"queue_overflow_rejections_impala_ControlService_ReportExecStatus",
        "value": 0
}, {
        "name": 
"queue_overflow_rejections_impala_DataStreamService_UpdateFilterFromRemote",
        "value": 0
}, {
        "name": "handler_latency_impala_ControlService_KillQuery",
        "total_count": 0,
        "min": 0,
        "mean": 0.0,
        "percentile_75": 0,
        "percentile_95": 0,
        "percentile_99": 0,
        "percentile_99_9": 0,
        "percentile_99_99": 0,
        "max": 0,
        "total_sum": 0
}, {
        "name": 
"queue_overflow_rejections_impala_DataStreamService_EndDataStream",
        "value": 0
}, {
        "name": "handler_latency_impala_ControlService_CancelQueryFInstances",
        "total_count": 34,
        "min": 26,
        "mean": 78.02941176470589,
        "percentile_75": 80,
        "percentile_95": 124,
        "percentile_99": 328,
        "percentile_99_9": 328,
        "percentile_99_99": 328,
        "max": 328,
        "total_sum": 2653
}, {
        "name": 
"queue_overflow_rejections_impala_ControlService_CancelQueryFInstances",
        "value": 0
}, {
        "name": 
"queue_overflow_rejections_impala_ControlService_RemoteShutdown",
        "value": 0
}, {
        "name": "queue_overflow_rejections_impala_ControlService_KillQuery",
        "value": 0
}]
{code}

Our existing /rpcz page captures much of this: {{rpc_connections_accepted}}, 
{{handler_latency_impala_<service>_<method>}}, {{rpcs_queue_overflow}} as an 
aggregate of {{queue_overflow_rejections_...}}, and 
{{impala_incoming_queue_time}}.

Potentially useful additions are {{reactor_active_latency_us}} and 
{{reactor_load_percent}}.

> Add metrics for RPC reactor and deserialization threads
> -------------------------------------------------------
>
>                 Key: IMPALA-13164
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13164
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend, Distributed Exec
>    Affects Versions: Impala 4.4.0
>            Reporter: Michael Smith
>            Assignee: Michael Smith
>            Priority: Critical
>
> When Impala starts to build up a queue in an RPC service - usually 
> DataStreamService - they become a bottleneck that slows down all exchanges 
> going through that node. To identify what's happening we generally need to 
> collect pstacks to identify why the queue is backed up.
> If all KrpcDataStreamMgr deserialization threads are in use, it can help to 
> increase the number of threads available via 
> {{{}datastream_service_num_deserialization_threads{}}}. This is the most 
> common bottleneck on larger machines; DataStream packets can be large and 
> take time to deserialize.
> Less frequently the number of reactor threads in RpcMgr sending requests and 
> receiving responses (so handling the actual network I/O) might be full; this 
> is much less common because {{num_reactor_threads}} defaults to the number of 
> CPU cores.
> ControlService uses a single thread pool for network I/O and deserialization 
> because the packets tend to be much smaller; this is controlled by 
> {{{}control_service_num_svc_threads{}}}, which defaults to the number of CPU 
> cores as well.
> It would help to have metrics for RPC reactor and deserialization threads to 
> tell when they're fully loaded and building up a queue; this could identify 
> when we need to increase {{datastream_service_num_deserialization_threads}} 
> or {{{}num_reactor_threads{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13164) Add metrics for RPC reactor and deserialization threads

Reply via email to