[jira] [Updated] (SOLR-15059) Default Grafana dashboard needs to expose graphs for monitoring query performance

Timothy Potter (Jira) Thu, 07 Jan 2021 09:10:18 -0800


     [ 
https://issues.apache.org/jira/browse/SOLR-15059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Timothy Potter updated SOLR-15059:
----------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Default Grafana dashboard needs to expose graphs for monitoring query 
> performance
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-15059
>                 URL: https://issues.apache.org/jira/browse/SOLR-15059
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Grafana Dashboard, metrics
>            Reporter: Timothy Potter
>            Assignee: Timothy Potter
>            Priority: Major
>             Fix For: 8.8, master (9.0)
>
>         Attachments: Screen Shot 2020-12-23 at 10.22.43 AM.png
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The default Grafana dashboard doesn't expose graphs for monitoring query 
> performance. For instance, if I want to see QPS for a collection, that's not 
> shown in the default dashboard. Same for quantiles like p95 query latency.
> After some digging, these metrics are available in the output from 
> {{/admin/metrics}} but are not exported by the exporter.
> This PR proposes to enhance the default dashboard with a new Query Metrics 
> section with the following metrics:
> * Distributed QPS per Collection (aggregated across all cores)
> * Distributed QPS per Solr Node (aggregated across all base_url)
> * QPS 1-min rate per core
> * QPS 5-min rate per core
> * Top-level Query latency p99, p95, p75
> * Local (non-distrib) query count per core (this is important for determining 
> if there is unbalanced load)
> * Local (non-distrib) query rate per core (1-min)
> * Local (non-distrib) p95 per core
> Also, the {{solr-exporter-config.xml}} uses {{jq}} queries to pull metrics 
> from the output from {{/admin/metrics}}. This file is huge and contains a 
> bunch of {{jq}} boilerplate. Moreover, I'm introducing another 15-20 metrics 
> in this PR, it only makes the file more verbose.
> Thus, I'm also introducing support for jq templates so as to reduce 
> boilerplate, reduce syntax errors, and improve readability. For instance the 
> query metrics I'm adding to the config look like this:
> {code}
>           <str>
>             $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
>           </str>
>           <str>
>             $jq:core-query(5minRate, endswith(".distrib.requestTimes"))
>           </str>
> {code}
> Instead of duplicating the complicated {{jq}} query for each metric. The 
> templates are optional and only should be used if a given jq structure is 
> repeated 3 or more times. Otherwise, inlining the jq query is still 
> supported. Here's how the templates work:
> {code}
>   A regex with named groups is used to match template references to template 
> + vars using the basic pattern:
>       $jq:<TEMPLATE>( <UNIQUE>, <KEYSELECTOR>, <METRIC>, <TYPE> )
>   For instance,
>       $jq:core(requests_total, endswith(".requestTimes"), count, COUNTER)
>   TEMPLATE = core
>   UNIQUE = requests_total (unique suffix for this metric, results in a metric 
> named "solr_metrics_core_requests_total")
>   KEYSELECTOR = endswith(".requestTimes") (filter to select the specific key 
> for this metric)
>   METRIC = count
>   TYPE = COUNTER
>   Some templates may have a default type, so you can omit that from your 
> template reference, such as:
>       $jq:core(requests_total, endswith(".requestTimes"), count)
>   Uses the defaultType=COUNTER as many uses of the core template are counts.
>   If a template reference omits the metric, then the unique suffix is used, 
> for instance:
>       $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
>   Creates a GAUGE metric (default type) named 
> "solr_metrics_core_query_1minRate" using the 1minRate value from the selected 
> JSON object.
> {code}
> Just so people don't have to go digging in the large diff on the config XML, 
> here are the query metrics I'm adding to the exporter config with use of the 
> templates idea:
> {code}
>           <str>
>             $jq:core-query(errors_1minRate, select(.key | 
> endswith(".errors")), 1minRate)
>           </str>
>           <str>
>             $jq:core-query(client_errors_1minRate, select(.key | 
> endswith(".clientErrors")), 1minRate)
>           </str>
>           <str>
>             $jq:core-query(1minRate, select(.key | 
> endswith(".distrib.requestTimes")), 1minRate)
>           </str>
>           <str>
>             $jq:core-query(5minRate, select(.key | 
> endswith(".distrib.requestTimes")), 5minRate)
>           </str>
>           <str>
>             $jq:core-query(median_ms, select(.key | 
> endswith(".distrib.requestTimes")), median_ms)
>           </str>
>           <str>
>             $jq:core-query(p75_ms, select(.key | 
> endswith(".distrib.requestTimes")), p75_ms)
>           </str>
>           <str>
>             $jq:core-query(p95_ms, select(.key | 
> endswith(".distrib.requestTimes")), p95_ms)
>           </str>
>           <str>
>             $jq:core-query(p99_ms, select(.key | 
> endswith(".distrib.requestTimes")), p99_ms)
>           </str>
>           <str>
>             $jq:core-query(mean_rate, select(.key | 
> endswith(".distrib.requestTimes")), meanRate)
>           </str>
>           
>           <!-- Local (non-distrib) query metrics -->
>           <str>
>             $jq:core-query(local_1minRate, select(.key | 
> endswith(".local.requestTimes")), 1minRate)
>           </str>
>           <str>
>             $jq:core-query(local_5minRate, select(.key | 
> endswith(".local.requestTimes")), 5minRate)
>           </str>
>           <str>
>             $jq:core-query(local_median_ms, select(.key | 
> endswith(".local.requestTimes")), median_ms)
>           </str>
>           <str>
>             $jq:core-query(local_p75_ms, select(.key | 
> endswith(".local.requestTimes")), p75_ms)
>           </str>
>           <str>
>             $jq:core-query(local_p95_ms, select(.key | 
> endswith(".local.requestTimes")), p95_ms)
>           </str>
>           <str>
>             $jq:core-query(local_p99_ms, select(.key | 
> endswith(".local.requestTimes")), p99_ms)
>           </str>
>           <str>
>             $jq:core-query(local_mean_rate, select(.key | 
> endswith(".local.requestTimes")), meanRate)
>           </str>
>           <str>
>             $jq:core-query(local_count, select(.key | 
> endswith(".local.requestTimes")), count, COUNTER)
>           </str>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-15059) Default Grafana dashboard needs to expose graphs for monitoring query performance

Reply via email to