[ https://issues.apache.org/jira/browse/SOLR-15059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Timothy Potter updated SOLR-15059: ---------------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) > Default Grafana dashboard needs to expose graphs for monitoring query > performance > --------------------------------------------------------------------------------- > > Key: SOLR-15059 > URL: https://issues.apache.org/jira/browse/SOLR-15059 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Grafana Dashboard, metrics > Reporter: Timothy Potter > Assignee: Timothy Potter > Priority: Major > Fix For: 8.8, master (9.0) > > Attachments: Screen Shot 2020-12-23 at 10.22.43 AM.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > The default Grafana dashboard doesn't expose graphs for monitoring query > performance. For instance, if I want to see QPS for a collection, that's not > shown in the default dashboard. Same for quantiles like p95 query latency. > After some digging, these metrics are available in the output from > {{/admin/metrics}} but are not exported by the exporter. > This PR proposes to enhance the default dashboard with a new Query Metrics > section with the following metrics: > * Distributed QPS per Collection (aggregated across all cores) > * Distributed QPS per Solr Node (aggregated across all base_url) > * QPS 1-min rate per core > * QPS 5-min rate per core > * Top-level Query latency p99, p95, p75 > * Local (non-distrib) query count per core (this is important for determining > if there is unbalanced load) > * Local (non-distrib) query rate per core (1-min) > * Local (non-distrib) p95 per core > Also, the {{solr-exporter-config.xml}} uses {{jq}} queries to pull metrics > from the output from {{/admin/metrics}}. This file is huge and contains a > bunch of {{jq}} boilerplate. Moreover, I'm introducing another 15-20 metrics > in this PR, it only makes the file more verbose. > Thus, I'm also introducing support for jq templates so as to reduce > boilerplate, reduce syntax errors, and improve readability. For instance the > query metrics I'm adding to the config look like this: > {code} > <str> > $jq:core-query(1minRate, endswith(".distrib.requestTimes")) > </str> > <str> > $jq:core-query(5minRate, endswith(".distrib.requestTimes")) > </str> > {code} > Instead of duplicating the complicated {{jq}} query for each metric. The > templates are optional and only should be used if a given jq structure is > repeated 3 or more times. Otherwise, inlining the jq query is still > supported. Here's how the templates work: > {code} > A regex with named groups is used to match template references to template > + vars using the basic pattern: > $jq:<TEMPLATE>( <UNIQUE>, <KEYSELECTOR>, <METRIC>, <TYPE> ) > For instance, > $jq:core(requests_total, endswith(".requestTimes"), count, COUNTER) > TEMPLATE = core > UNIQUE = requests_total (unique suffix for this metric, results in a metric > named "solr_metrics_core_requests_total") > KEYSELECTOR = endswith(".requestTimes") (filter to select the specific key > for this metric) > METRIC = count > TYPE = COUNTER > Some templates may have a default type, so you can omit that from your > template reference, such as: > $jq:core(requests_total, endswith(".requestTimes"), count) > Uses the defaultType=COUNTER as many uses of the core template are counts. > If a template reference omits the metric, then the unique suffix is used, > for instance: > $jq:core-query(1minRate, endswith(".distrib.requestTimes")) > Creates a GAUGE metric (default type) named > "solr_metrics_core_query_1minRate" using the 1minRate value from the selected > JSON object. > {code} > Just so people don't have to go digging in the large diff on the config XML, > here are the query metrics I'm adding to the exporter config with use of the > templates idea: > {code} > <str> > $jq:core-query(errors_1minRate, select(.key | > endswith(".errors")), 1minRate) > </str> > <str> > $jq:core-query(client_errors_1minRate, select(.key | > endswith(".clientErrors")), 1minRate) > </str> > <str> > $jq:core-query(1minRate, select(.key | > endswith(".distrib.requestTimes")), 1minRate) > </str> > <str> > $jq:core-query(5minRate, select(.key | > endswith(".distrib.requestTimes")), 5minRate) > </str> > <str> > $jq:core-query(median_ms, select(.key | > endswith(".distrib.requestTimes")), median_ms) > </str> > <str> > $jq:core-query(p75_ms, select(.key | > endswith(".distrib.requestTimes")), p75_ms) > </str> > <str> > $jq:core-query(p95_ms, select(.key | > endswith(".distrib.requestTimes")), p95_ms) > </str> > <str> > $jq:core-query(p99_ms, select(.key | > endswith(".distrib.requestTimes")), p99_ms) > </str> > <str> > $jq:core-query(mean_rate, select(.key | > endswith(".distrib.requestTimes")), meanRate) > </str> > > <!-- Local (non-distrib) query metrics --> > <str> > $jq:core-query(local_1minRate, select(.key | > endswith(".local.requestTimes")), 1minRate) > </str> > <str> > $jq:core-query(local_5minRate, select(.key | > endswith(".local.requestTimes")), 5minRate) > </str> > <str> > $jq:core-query(local_median_ms, select(.key | > endswith(".local.requestTimes")), median_ms) > </str> > <str> > $jq:core-query(local_p75_ms, select(.key | > endswith(".local.requestTimes")), p75_ms) > </str> > <str> > $jq:core-query(local_p95_ms, select(.key | > endswith(".local.requestTimes")), p95_ms) > </str> > <str> > $jq:core-query(local_p99_ms, select(.key | > endswith(".local.requestTimes")), p99_ms) > </str> > <str> > $jq:core-query(local_mean_rate, select(.key | > endswith(".local.requestTimes")), meanRate) > </str> > <str> > $jq:core-query(local_count, select(.key | > endswith(".local.requestTimes")), count, COUNTER) > </str> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org