[
https://issues.apache.org/jira/browse/IMPALA-13164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18070484#comment-18070484
]
Michael Smith commented on IMPALA-13164:
----------------------------------------
Metrics from kudu::MetricRegistry look like
{code}
[{
"name": "rpc_connections_accepted",
"value": 6
}, {
"name":
"queue_overflow_rejections_impala_DataStreamService_PublishFilter",
"value": 0
}, {
"name":
"handler_latency_impala_DataStreamService_UpdateFilterFromRemote",
"total_count": 0,
"min": 0,
"mean": 0.0,
"percentile_75": 0,
"percentile_95": 0,
"percentile_99": 0,
"percentile_99_9": 0,
"percentile_99_99": 0,
"max": 0,
"total_sum": 0
}, {
"name": "handler_latency_impala_DataStreamService_UpdateFilter",
"total_count": 181,
"min": 4,
"mean": 442.1436464088398,
"percentile_75": 490,
"percentile_95": 1344,
"percentile_99": 2624,
"percentile_99_9": 8832,
"percentile_99_99": 8832,
"max": 8876,
"total_sum": 80028
}, {
"name": "handler_latency_impala_DataStreamService_EndDataStream",
"total_count": 347,
"min": 1,
"mean": 15.187319884726226,
"percentile_75": 8,
"percentile_95": 16,
"percentile_99": 532,
"percentile_99_9": 672,
"percentile_99_99": 672,
"max": 675,
"total_sum": 5270
}, {
"name": "handler_latency_impala_DataStreamService_TransmitData",
"total_count": 18722,
"min": 2,
"mean": 409.2033970729623,
"percentile_75": 25,
"percentile_95": 48,
"percentile_99": 796,
"percentile_99_9": 14592,
"percentile_99_99": 577536,
"max": 804053,
"total_sum": 7661106
}, {
"name": "reactor_active_latency_us",
"total_count": 118484,
"min": 0,
"mean": 11.475102123493468,
"percentile_75": 7,
"percentile_95": 26,
"percentile_99": 131,
"percentile_99_9": 1208,
"percentile_99_99": 4080,
"max": 9104,
"total_sum": 1359616
}, {
"name":
"queue_overflow_rejections_impala_DataStreamService_TransmitData",
"value": 0
}, {
"name": "handler_latency_impala_ControlService_ReportExecStatus",
"total_count": 93,
"min": 29,
"mean": 2546.129032258065,
"percentile_75": 3392,
"percentile_95": 8256,
"percentile_99": 10112,
"percentile_99_9": 15360,
"percentile_99_99": 15360,
"max": 15368,
"total_sum": 236790
}, {
"name": "handler_latency_impala_DataStreamService_PublishFilter",
"total_count": 70,
"min": 98,
"mean": 380.3428571428572,
"percentile_75": 480,
"percentile_95": 1080,
"percentile_99": 1312,
"percentile_99_9": 1984,
"percentile_99_99": 1984,
"max": 1988,
"total_sum": 26624
}, {
"name": "reactor_load_percent",
"total_count": 16920,
"min": 0,
"mean": 0.1254728132387707,
"percentile_75": 0,
"percentile_95": 0,
"percentile_99": 2,
"percentile_99_9": 25,
"percentile_99_99": 33,
"max": 33,
"total_sum": 2123
}, {
"name":
"queue_overflow_rejections_impala_ControlService_ExecQueryFInstances",
"value": 0
}, {
"name":
"queue_overflow_rejections_impala_DataStreamService_UpdateFilter",
"value": 0
}, {
"name": "handler_latency_impala_ControlService_RemoteShutdown",
"total_count": 0,
"min": 0,
"mean": 0.0,
"percentile_75": 0,
"percentile_95": 0,
"percentile_99": 0,
"percentile_99_9": 0,
"percentile_99_99": 0,
"max": 0,
"total_sum": 0
}, {
"name": "handler_latency_impala_ControlService_ExecQueryFInstances",
"total_count": 35,
"min": 224,
"mean": 4184.685714285714,
"percentile_75": 3136,
"percentile_95": 16192,
"percentile_99": 37632,
"percentile_99_9": 37632,
"percentile_99_99": 37632,
"max": 37884,
"total_sum": 146464
}, {
"name": "impala_incoming_queue_time",
"total_count": 19482,
"min": 2,
"mean": 50.31377681962837,
"percentile_75": 13,
"percentile_95": 100,
"percentile_99": 1038,
"percentile_99_9": 4648,
"percentile_99_99": 7684,
"max": 12846,
"total_sum": 980213
}, {
"name":
"queue_overflow_rejections_impala_ControlService_ReportExecStatus",
"value": 0
}, {
"name":
"queue_overflow_rejections_impala_DataStreamService_UpdateFilterFromRemote",
"value": 0
}, {
"name": "handler_latency_impala_ControlService_KillQuery",
"total_count": 0,
"min": 0,
"mean": 0.0,
"percentile_75": 0,
"percentile_95": 0,
"percentile_99": 0,
"percentile_99_9": 0,
"percentile_99_99": 0,
"max": 0,
"total_sum": 0
}, {
"name":
"queue_overflow_rejections_impala_DataStreamService_EndDataStream",
"value": 0
}, {
"name": "handler_latency_impala_ControlService_CancelQueryFInstances",
"total_count": 34,
"min": 26,
"mean": 78.02941176470589,
"percentile_75": 80,
"percentile_95": 124,
"percentile_99": 328,
"percentile_99_9": 328,
"percentile_99_99": 328,
"max": 328,
"total_sum": 2653
}, {
"name":
"queue_overflow_rejections_impala_ControlService_CancelQueryFInstances",
"value": 0
}, {
"name":
"queue_overflow_rejections_impala_ControlService_RemoteShutdown",
"value": 0
}, {
"name": "queue_overflow_rejections_impala_ControlService_KillQuery",
"value": 0
}]
{code}
Our existing /rpcz page captures much of this: {{rpc_connections_accepted}},
{{handler_latency_impala_<service>_<method>}}, {{rpcs_queue_overflow}} as an
aggregate of {{queue_overflow_rejections_...}}, and
{{impala_incoming_queue_time}}.
Potentially useful additions are {{reactor_active_latency_us}} and
{{reactor_load_percent}}.
> Add metrics for RPC reactor and deserialization threads
> -------------------------------------------------------
>
> Key: IMPALA-13164
> URL: https://issues.apache.org/jira/browse/IMPALA-13164
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend, Distributed Exec
> Affects Versions: Impala 4.4.0
> Reporter: Michael Smith
> Assignee: Michael Smith
> Priority: Critical
>
> When Impala starts to build up a queue in an RPC service - usually
> DataStreamService - they become a bottleneck that slows down all exchanges
> going through that node. To identify what's happening we generally need to
> collect pstacks to identify why the queue is backed up.
> If all KrpcDataStreamMgr deserialization threads are in use, it can help to
> increase the number of threads available via
> {{{}datastream_service_num_deserialization_threads{}}}. This is the most
> common bottleneck on larger machines; DataStream packets can be large and
> take time to deserialize.
> Less frequently the number of reactor threads in RpcMgr sending requests and
> receiving responses (so handling the actual network I/O) might be full; this
> is much less common because {{num_reactor_threads}} defaults to the number of
> CPU cores.
> ControlService uses a single thread pool for network I/O and deserialization
> because the packets tend to be much smaller; this is controlled by
> {{{}control_service_num_svc_threads{}}}, which defaults to the number of CPU
> cores as well.
> It would help to have metrics for RPC reactor and deserialization threads to
> tell when they're fully loaded and building up a queue; this could identify
> when we need to increase {{datastream_service_num_deserialization_threads}}
> or {{{}num_reactor_threads{}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]