[ 
https://issues.apache.org/jira/browse/HADOOP-17893?focusedWorklogId=649845&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-649845
 ]

ASF GitHub Bot logged work on HADOOP-17893:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Sep/21 06:40
            Start Date: 13/Sep/21 06:40
    Worklog Time Spent: 10m 
      Work Description: Neilxzn opened a new pull request #3426:
URL: https://github.com/apache/hadoop/pull/3426


   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   https://issues.apache.org/jira/browse/HADOOP-17893
   
   ### How was this patch tested?
   add test testTopMetricsPublish
   
   ### For code changes:
   Parse TopMetrics in PrometheusMetricsSink
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 649845)
    Remaining Estimate: 0h
            Time Spent: 10m

> Improve PrometheusSink for Namenode and ResourceManager Metrics
> ---------------------------------------------------------------
>
>                 Key: HADOOP-17893
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17893
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: metrics
>    Affects Versions: 3.4.0
>            Reporter: Max  Xie
>            Assignee: Max  Xie
>            Priority: Minor
>         Attachments: HADOOP-17893.01.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> HADOOP-16398 added exporter for hadoop metrics to prometheus. But some of 
> metrics can't be exported  validly. For example like these metrics, 
> 1.  queue metrics for ResourceManager
> {code:java}
> queue_metrics_max_capacity{queue="root.queue1",context="yarn",hostname="rm_host1"}
>  1
> // queue2's metric can't be exported 
> queue_metrics_max_capacity{queue="root.queue2",context="yarn",hostname="rm_host1"}
>  2
> {code}
> It always exported  only one queue's metric because 
> PrometheusMetricsSink$metricLines only cache one metric  if theses metrics 
> have the same name no matter these metrics has different metric tags.
>  
> 2. rpc metrics for Namenode
> Namenode may have rpc metrics with multi port like service-rpc. But because  
> the same reason  as  Issue 1, it wiil lost some rpc metrics if we use 
> PrometheusSink.
> {code:java}
> rpc_rpc_queue_time300s90th_percentile_latency{port="9000",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
>  0
> // rpc port=9005 metric can't be exported 
> rpc_rpc_queue_time300s90th_percentile_latency{port="9005",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
>  0
> {code}
> 3. TopMetrics for Namenode
> org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics is a special 
> metric. And I think It is essentially a Summary metric type. TopMetrics 
> record name will according to different user and op ,  which means that these 
> metric will always exist in PrometheusMetricsSink$metricLines and it may 
> cause the risk of its memory leak. We e need to treat it special. 
> {code:java}
> // invaild topmetric export
> # TYPE 
> nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count
>  counter
> nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
>  10
> // it should be 
> # TYPE nn_top_user_op_counts_window_ms_1500000_count counter
> nn_top_user_op_counts_window_ms_1500000_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
>  10{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to