This is an automated email from the ASF dual-hosted git repository. xxyu pushed a commit to branch doc5.0 in repository https://gitbox.apache.org/repos/asf/kylin.git
commit e840d1fb2af5a143f8b43514a8fa699227da43da Author: Mukvin <boyboys...@163.com> AuthorDate: Fri Aug 12 18:04:29 2022 +0800 KYLIN-5221, add monitoring in operations --- .../operations/monitoring/images/dashboard.jpg | Bin 0 -> 235619 bytes .../docs/operations/monitoring/images/interval.png | Bin 0 -> 162846 bytes .../operations/monitoring/influxdb/influxdb.md | 211 +++++++++++++++++++++ .../monitoring/influxdb/influxdb_maintenance.md | 124 ++++++++++++ .../docs/operations/monitoring/influxdb/intro.md | 22 +++ website/docs/operations/monitoring/intro.md | 24 +++ .../docs/operations/monitoring/metrics_intro.md | 209 ++++++++++++++++++++ website/docs/operations/monitoring/service.md | 181 ++++++++++++++++++ 8 files changed, 771 insertions(+) diff --git a/website/docs/operations/monitoring/images/dashboard.jpg b/website/docs/operations/monitoring/images/dashboard.jpg new file mode 100644 index 0000000000..9d893aaa28 Binary files /dev/null and b/website/docs/operations/monitoring/images/dashboard.jpg differ diff --git a/website/docs/operations/monitoring/images/interval.png b/website/docs/operations/monitoring/images/interval.png new file mode 100644 index 0000000000..f89954398b Binary files /dev/null and b/website/docs/operations/monitoring/images/interval.png differ diff --git a/website/docs/operations/monitoring/influxdb/influxdb.md b/website/docs/operations/monitoring/influxdb/influxdb.md new file mode 100644 index 0000000000..d0636f0f64 --- /dev/null +++ b/website/docs/operations/monitoring/influxdb/influxdb.md @@ -0,0 +1,211 @@ +--- +title: Use InfluxDB as Time-Series Database +language: en +sidebar_label: Use InfluxDB as Time-Series Database +pagination_label: Use InfluxDB as Time-Series Database +toc_min_heading_level: 2 +toc_max_heading_level: 6 +pagination_prev: null +pagination_next: null +keywords: + - influxdb +draft: false +last_update: + date: 12/08/2022 +--- + + +### <span id="preparation">Preparation</span> + +Starting with Kylin 5.0, the system uses RDBMS to store query history, which only use InfluxDB to record the monitoring information of the system. + +If you need this information, you need to configure the time series database InfluxDB in advance to store data such as the monitoring information of the system. + +We recommend you to use InfluxDB v1.6.4, which is provided in the Kylin installation package. + +The InfluxDB installation package, `influxdb-1.6.4.x86_64.rpm` is under the `influxdb` directory in the installation directory of Kylin. + +If you need to use the existed InfluxDB database in your environment, please use the versions below: + +- InfluxDB 1.6.4 or above + +You can use the following command to check the version of InfluxDB in your current environment. + +```shell +service influxdb version +``` + +### <span id="root">Installation and Configuration for `root` User</span> + +The following steps are using InfluxDB 1.6.4 as an example. + +1. Run command to check if InfluxDB is installed already. + + ```shell + service influxdb status + ``` + + If not, you can go to the directory where the InfluxDB installation package is located and install InfluxDB. + + ```shell + cd $KYLIN_HOME/influxdb + rpm -ivh influxdb-1.6.4.x86_64.rpm + ``` + +2. Launch InfluxDB. + + ```sh + service influxdb start + ``` + + By default, you can find InfluxDB's log at `/var/log/influxdb`. + +3. If your InfluxDB server port is in use, you can modify the InfluxDB configuration file to change the server port. + + ```sh + vi /etc/influxdb/influxdb.conf + ``` + + Please note the following three points: + + - Modify RPC port: The initial property is `bind-address = "127.0.0.1:8088"`, you can change `8088` to an available port, for instance, `18087`. + - Modify HTTP port: The initial property is `bind-address = ":8086"`, you can change `8086` to available port, for instance, `18086`. + - Set `reporting-disabled = true`, which means that the InfluxDB will not send reports to [influxdata.com](https://www.influxdata.com/) regularly. + +4. InfluxDB is accessible without a user name and password by default. If you want to strengthen the security level, you can set a password with the following steps: + + 1. log in InfluxDB. + + ```sh + influx -port 18086 + ``` + + **Tips:** Please replace `18086` with an actually available port number. + + 2. Manage admin user and password. + + ```mariadb + CREATE USER admin WITH PASSWORD 'admin' WITH ALL PRIVILEGES + ``` + + 3. Open the configuration file and modify ` [http] auth-enabled = true` to enable authorization. + + ```sh + vi /etc/influxdb/influxdb.conf + ``` + + 4. Restart InfluxDB to take effect and login InfluxDB. + + ```sh + service influxdb restart + influx -port 18086 -username admin -password admin + ``` + +5. Open the property file `kylin.properties` and modify the InfluxDB configurations. Please replace `ip:http_port`, `user`, `pwd`, `ip:rpc_port` with real values. + + ```properties + vi $KYLIN_HOME/conf/kylin.properties + + ### Modify the following properties + + kylin.influxdb.address=ip:http_port + kylin.influxdb.username=user + kylin.influxdb.password=pwd + kylin.metrics.influx-rpc-service-bind-address=ip:rpc_port + ``` + + + **Note**: If more than one Kylin instances are deployed, you should configure the above configurations in `kylin.properties` for each Kylin node and let them point to the same Influxdb instance. + +6. Encrypt influxdb password + + If you need to encrypt influxdb's password, you can do it like this: + + **i.** run following commands in ${KYLIN_HOME}, it will print encrypted password + ```shell + ./bin/kylin.sh org.apache.kylin.tool.general.CryptTool -e AES -s <password> + ``` + + **ii.** config kylin.influxdb.password like this + ```properties + kylin.influxdb.password=ENC('${encrypted_password}') + ``` + + **iii.** Here is an example, assuming influxdb's password is kylin + + First, we need to encrypt kylin using the following command + ```shell + ${KYLIN_HOME}/bin/kylin.sh org.apache.kylin.tool.general.CryptTool -e AES -s kylin + AES encrypted password is: + YeqVr9MakSFbgxEec9sBwg== + ``` + Then, config kylin.metadata.url like this: + ```properties + kylin.influxdb.password=ENC('YeqVr9MakSFbgxEec9sBwg==') + ``` + +7. start Kylin. + +### <span id="not_root">Installation and Configuration for Non `root` User </span> + +The following steps are using InfluxDB 1.6.4 as an example. + + +1. Suppose you install as user `abc` . Then create a directory `home/abc/influx` to copy the InfluxDB installation package, `influxdb-1.6.4.x86_64.rpm` ,from `$KYLIN_HOME/influxdb` to this directory. + + ```sh + mkdir /home/abc/influx + cp $KYLIN_HOME/influxdb/influxdb-1.6.4.x86_64.rpm /home/abc/influx + cd /home/abc/influx + rpm2cpio influxdb-1.6.4.x86_64.rpm | cpio -idmv + ``` + +2. Edit the InfluxDB configuration file and replace `/var/lib` with `/home/abc/influx` globally. Also, you can modify `bind-address` property according to your case. + + ```sh + vi /home/abc/influx/etc/influxdb/influxdb.conf + ``` + +3. Run following command to launch InfluxDB. + + ```sh + nohup ./usr/bin/influxd run -config /home/abc/influx/etc/influxdb/influxdb.conf & + ``` + By default, you can find InfluxDB's log at `/home/abc/influx/var/log/influxdb`. + +4. As for other configurations, please refer to the second part in this section [`root` User Installation And Configuration](#root). Note that if you want to restart influxdb, you need to execute the following commands. Using `service influxdb restart` will not work since it requires root permission. + + ```sh + ps -ef | grep influxdb + kill {pid} + ``` + +5. Launch Kylin. + + +### <span id="service">InfluxDB Connectivity</span> +To ensure the connectivity of InfluxDB service, it is recommended that you perform some tests after starting InfluxDB. +- Log in to InfluxDB by entering the command line in the terminal: + ```sh + /home/abc/influx/usr/bin/influx -port 18086 -username ${username} -password ${pwd} + ``` + If the login fails, you can set `auth-enabled = true` in the configuration file `influxdb.conf` and try to login again. +- After successful login , you can execute some simple queries to check if InfluxDB is configured correctly: + ```sql + show databases; + ``` + If the query fails and the message `authorization failed` is displayed, please confirm whether the user has sufficient permissions. + +For more information about InfluxDB connectivity, please refer to the [InfluxDB Maintenance](../../Operation-and-Maintenance-Guide/influxdb/influxdb_maintenance.en.md) section. + + + +### <span id="https">(optional)Configure HTTPS connection to InfluxDB </span> + +Before using HTTPS to connect to InfluxDB, you need to enable InfluxDB's HTTPS connection. To enable HTTPS for InfluxDB please refer to the official documentation: [Enabling HTTPS with InfluxDB](https://docs.influxdata.com/influxdb/v1.6/administration/https_setup/)。 + +If the InfluxDB you are using has enabled HTTPS connection, please set the following parameters in the `$KYLIN_HOME/conf/kylin.properties` configuration file: + +``` +kylin.influxdb.https.enabled=true +``` diff --git a/website/docs/operations/monitoring/influxdb/influxdb_maintenance.md b/website/docs/operations/monitoring/influxdb/influxdb_maintenance.md new file mode 100644 index 0000000000..632a6c675d --- /dev/null +++ b/website/docs/operations/monitoring/influxdb/influxdb_maintenance.md @@ -0,0 +1,124 @@ +--- +title: InfluxDB Maintenance +language: en +sidebar_label: InfluxDB Maintenance +pagination_label: InfluxDB Maintenance +toc_min_heading_level: 2 +toc_max_heading_level: 6 +pagination_prev: null +pagination_next: null +keywords: + - influxdb +draft: false +last_update: + date: 12/08/2022 +--- + + +## InfluxDB Maintenance + +This chapter introduces the basic maintenance of InfluxDB. + +### Connectivity + +When InfluxDB is not accessible, you can locate the problem from the following aspects: + +1. Check if InfluxDB is running normally by executing `service influxdb status`. If it is not running, please check log files of `/var/log/influxdb/influxd.log` or `/var/log/messages` to find out the reason, at the same time, run `service influxdb restart` to restart InfluxDB service and make sure the service can be launched normally by observing the logs. (You should be able to login InfluxDB via `influx -host ? -port ?` command) + +2. If you find the port has been taken in the starting process, run `netstat -anp | grep influxdb_port` to get the process id, and execute `ps -ef | grep pid` to get the specific process. You can choose to kill the process if you do not need it or to change InfluxDB's server port to another. + +3. If you are having your Kylin and InfluxDB installed in different nodes, please execute `telnet influxdb_ip influxdb_port` on Kylin node to check if two nodes can communicate normally, if not, please make sure the Firewall service is not turned on on InfluxDB node via `service iptables status` command or contact the system admin to check the network condition. + +### Log Management + +- **Log Configuration** + + - By default, InfluxDB writes standard error to log. InfluxDB redirects stderr to `/var/log/influxdb/influxd.log` file when it is started. If you would like to change the log path, please modify the property in the configuration file `/etc/default/influxdb` to `STDERR=/path/to/influxdb.log`, and restart the service via `service influxdb restart` command. + - InfluxDB enables HTTP access log by default. Generally, HTTP access log is quite large, you can modify the property `[http] log-enabled=false` to disable the log output. + +- **Log Clean** + + InfluxDB itself does not clean its log regularly, it uses **logrotate** to manage log, which is installed on Linux system by default. The configuration file of **logrotate** is located at `/etc/logrotate.d/influxdb`, the log rotates by day, and the retention is 7 days. + +### Backup and Restore + +InfluxDB provides the availability to do backup and restore. + +- **Backup** + + ```sh + influxd backup -portable -database KE_METRIC -host 127.0.0.1:8089 /path/to/backup + ``` + +- **Restore** + + Please make sure that the database exists, otherwise the restore will be failed. + + ```sh + influxd restore -portable -database KE_METRIC -host 127.0.0.1:8089 /path/to/backup + ``` + +> **note:** Please replace KE_METRIC with the actual database name, replace 127.0.0.1:8089 with the actual IP and port, replace `/path/to/backup` with the path you would like to set. + +### Monitoring and Diagnosis + +- **Memory Monitoring** + + - Check runtime + + Run following command to check GC, memory usage, etc. + `influx -database KE_METRIC -execute "show stats for 'runtime'"` + + Please focus on these important arguments: + - *HeapAlloc* -> Heap allocation size + - *Sys* -> The total number of bytes of memory obtained from the system + - *NumGC* -> GC times + - *PauseTotalNs* -> The total GC pause time + + - Check the memory usage of InfluxDB index + + `show stats for 'indexes'` + + - Monitor InfluxDB memory usage + + Run following command: + + `pidstat -rh -p PID 5` + + If the memory usage is too high or GC is too frequent, please increase memory. + + > **tips:** It is recommended to install InfluxDB on a separate machine with high memory allocation, because data read and write speed are dependent on the indexes, and the indexes are stored in memory. + +- **Disk Monitoring** + + Run following command to check disk situation: + + ```sh + pidstat -d -p PID 5 + ``` + + When the disk read/write load is found to be too high, you can consider mapping the WAL directory and the data directory to different disks to reduce the interaction between read and write operations. + + 1. Run `vi /etc/default/influxdb` to edit the configuration file. + 2. Modify the properties `[data] dir = "/var/lib/influxdb/data"` and `wal-dir = "/var/lib/influxdb/wal"` to point WAL directory and data directory to different disk. + +- **Read/Write Response Time** + + 1. Write: + + ```sql + SELECT non_negative_derivative(percentile("writeReqDurationNs", 99)) / non_negative_derivative(max("writeReq")) / (1000 * 1000) AS "Write Request" + FROM "_internal".."httpd" + WHERE time > now() - 10d + GROUP BY time(1h) fill(0) + ``` + + 2. Read: + + ```sql + SELECT non_negative_derivative(percentile("queryReqDurationNs", 99)) / non_negative_derivative(max("queryReq")) / (1000 * 1000) AS "Query Request" + FROM "_internal".."httpd" + WHERE time > now() - 10d + GROUP BY time(1h) + ``` + diff --git a/website/docs/operations/monitoring/influxdb/intro.md b/website/docs/operations/monitoring/influxdb/intro.md new file mode 100644 index 0000000000..96ea606197 --- /dev/null +++ b/website/docs/operations/monitoring/influxdb/intro.md @@ -0,0 +1,22 @@ +--- +title: InfluxDB +language: en +sidebar_label: InfluxDB +pagination_label: InfluxDB +toc_min_heading_level: 2 +toc_max_heading_level: 6 +pagination_prev: null +pagination_next: null +keywords: + - monitoring + - maintenance +draft: false +last_update: + date: 12/08/2022 +--- + +Kylin supports to use InfluxDB as its time series database, this chapter will cover: + +* [Use InfluxDB as Time-Series Database](influxdb.md) +* [InfluxDB Maintenance](influxdb_maintenance.md) + diff --git a/website/docs/operations/monitoring/intro.md b/website/docs/operations/monitoring/intro.md new file mode 100644 index 0000000000..2d2d302d57 --- /dev/null +++ b/website/docs/operations/monitoring/intro.md @@ -0,0 +1,24 @@ +--- +title: Monitoring +language: en +sidebar_label: Monitoring +pagination_label: Monitoring +toc_min_heading_level: 2 +toc_max_heading_level: 6 +pagination_prev: null +pagination_next: null +keywords: + - monitoring + - maintenance +draft: false +last_update: + date: 12/08/2022 +--- + +This chapter will discuss how to do system monitoring, we will cover: + +* [InfluxDB](influxdb/intro.md) + * [Use InfluxDB as Time-Series Database](influxdb/influxdb.md) + * [InfluxDB Maintenance](influxdb/influxdb_maintenance.md) +* [Metrics Monitoring](metrics_intro.md) +* [Service Monitoring](service.md) diff --git a/website/docs/operations/monitoring/metrics_intro.md b/website/docs/operations/monitoring/metrics_intro.md new file mode 100644 index 0000000000..cee36b3265 --- /dev/null +++ b/website/docs/operations/monitoring/metrics_intro.md @@ -0,0 +1,209 @@ +--- +title: Metrics Monitoring +language: en +sidebar_label: Metrics Monitoring +pagination_label: Metrics Monitoring +toc_min_heading_level: 2 +toc_max_heading_level: 6 +pagination_prev: null +pagination_next: null +keywords: + - metrics monitoring +draft: true +last_update: + date: 12/08/2022 +--- + + +By default, the system collects metric data every minute, including storage, query, job, metadata, and cleanup mechanism. The monitoring data is stored in the specified [InfluxDB](https://www.influxdata.com/time-series-platform/) and displayed through [Grafana](https://grafana.com/grafana). It can help administrators to understand the health of the system in order to take necessary actions. + +> **Note**: Since Grafana depends on InfluxDB, please make sure that InfluxDB is correctly configured and started according to [Use InfluxDB as Time-Series Database](influxdb/influxdb.md) before you use Grafana. + +### <span id="grafana_startup">Grafana</span> + +1. **Working Directory**: ``` $KYLIN_HOME/grafana``` +2. **Configuration Directory**: ```$KYLIN_HOME/grafana/conf``` +3. **Start Grafana Command**: ```$ KYLIN_HOME/bin/grafana.sh start``` +4. **Stop Grafana Command**: ``` $ KYLIN_HOME/bin/grafana.sh stop``` + +> Changing grafana configuration please refer to [Configuration](https://grafana.com/docs/installation/configuration/). + +After the startup is successful, you may access Grafana through web browser with default port: 3000, username: admin, password: admin + +[comment]: <#TODO> () + +### <span id="dashboard">Dashboard</span> + +Default Dashboard: ```Kylin``` + +The dashboard consists of 10 modules: Cluster, Summaries, Models, Queries, Favorites, Jobs, Cleanings, Metadata Operations, Transactions, among which Summaries module is automatically displayed in detail. Read more details about the modules, please refer to [Metrics Explanation](#explanation). If you want to make some changes for the dashboard, please refer to Grafana official website manual [Provisioning Grafana](https://grafana.com/docs/administration/provisioning/). + +### <span id="panel">Panel</span> + +Each indicator monitor corresponds to a specific panel. + +### <span id="__interval">Time Range</span> + +In the upper right corner of the dashboard, choose the time range. Time range: the time interval in which the indicator was observed. + + +### <span id="granularity">Data Granularity</span> + +Located in the upper left corner of the dashboard, the data granularity: auto, 1m, 5m, 10m, 30m, 1h, 6h, 12h, 1d, 7d, 14d, 30d ('auto' is automatically adjusted according to the time range, such as the time range '30min' corresponding granularity 5min, and the granularity corresponding to the time range of 24h is 4h). + +### <span id="explanation">Metrics Explanation</span> + +- [**Cluster**: Cluster overiew](#cluster) +- [**Summaries**: Global overview](#summaries) +- [**Models**: Model related metrics](#models) +- [**Queries**: Query related metrics](#queries) +- [**Favorites**: Favorite Query related metrics](#favorites) +- [**Jobs**: Job related metrics](#jobs) +- [**Cleanings**: Cleanup mechanisms related metrics](#cleanings) +- [**Metadata Operations**: Metadata operations related metrics](#metadata) +- [**Transactions**: Transaction mechanisms related metrics](#transactions) + +> **Tip **: “Project related” in the following table indicates whether the metric is related to the project, “Y” indicates that the metric is related to the project, and “N” indicates that the metric is not related to the project. "Host related" in the following table indicates whether the metric is related to Kylin nodes, "Y" indicates that the metric is related to the Kylin nodes, "N" indicates that the metric is not related to the host. "all", "job", "query" is Kylin nodes' server mode. + +<span id="cluster">**Cluster**:Cluster overview</span> + +| Name | Meaning | Project related | +| :------------- | :---------- | :----------- | +| build_unavailable_duration | the unavailable time of building | N | +| query_unavailable_duration | the unavailable time of query | N | + +<span id="summaries">**Summaries**: Global overview</span> + +| Name | Meaning | Project related | Host related | Remark | +| :------------- | :---------- | :----------- | :----------- | :----------- | +| summary_exec_total_times | Times of all indicators collected | N | Y(all, job, query) | The cost of collecting indicators | +| summary_exec_total_duration | Duration of all indicators collected | N | Y(all, job, query) | The cost of collecting indicators | +| num_of_projects | Total project number | N | N | - | +| storage_size_gauge | Storage used of the system | Y | N | - | +| num_of_users | Total user number | N | N | - | +| num_of_hive_tables | Total data table number | Y | N | - | +| num_of_hive_databases | Total database number | Y | N | - | +| summary_of_heap | The heap size of Kylin | N | Y(all, job, query) | - | +| usage_of_heap | The ratio of heap of Kylin | N | Y(all, job, query) | - | +| count_of_garbage_collection | The count of garbage collection | N | Y(all, job, query) | - | +| time_of_garbage_collection | The total time of garbage collection | N | Y(all, job, query) | - | +| garbage_size_gauge | Storage used of garbage | Y | N | Refer to the definition of "Garbage" | +| sparder_restart_total_times | "Sparder" restart times | N | Y(all, job, query) | "Sparder" is the internal query engine | +| query_load | spark sql load | N | Y(all, query) | - | +| cpu_cores | The number of cup cores for query configured in kylin.properties | N | Y(all, query) | Refer "Spark-related Configuration" | + +<span id="models">**Models**:Model related metrics</span> + +| Name | Meaning | Project related | Host related | +| :------------- | :---------- | :----------- | :----------- | +| model_num_gauge | "Model number: curve with time | Y | N | +| non_broken_model_num_gauge | "Healthy model number" curve with time | Y | N | +| last_query_time_of_models | The last query time of models | Y | N | +| hit_count_of_models | The query hit count of models | Y | N | +| storage_of_models | The storage of models | Y | N | +| segments_num_of_models | The num of segments of models | Y | N | +| model_build_duration | Total build time of models | Y | N | +| model_wait_duration | Total wait time of models | Y | N | +| number_of_indexes | indexes number of models | Y | N | +| expansion_rate_of_models | Expansion rate of models | Y | N | +| model_build_duration (avg) | Avg build time of models | Y | N | + +<span id="queries">**Queries**:Query related metrics</span> + +| Name | Meaning | Project related | Host related | Remark | +| :------------- | :---------- | :----------- | :----------- | :----------- | +| count_of_queries | Total count of queries | Y | Y(all, query) | - | +| num_of_query_per_host | The num of query per host | N | Y(all, query) | - | +| count_of_queries_hitting_agg_index | The count of queries hitting agg index | Y | Y(all, query) | - | +| ratio_of_queries_hitting_agg_index | The ratio of queries hitting agg index | Y | Y(all, query) | - | +| count_of_queries_hitting_table_index | The count of queries hitting table index | Y | Y(all, query) | - | +| ratio_of_queries_hitting_table_index | The ratio of queries hitting table index | Y | Y(all, query) | - | +| count_of_pushdown_queries | The count of pushdown queries | Y | Y(all, query) | - | +| ratio_of_pushdown_queries | The ratio of pushdown queries | Y | Y(all, query) | - | +| count_of_queries_hitting_cache | The count of queries hitting cache | Y | Y(all, query) | - | +| ratio_of_queries_hitting_cache | The ratio of queries hitting cache | Y | Y(all, query) | - | +| count_of_queries_less_than_1s | Total count of queries when duration is less than 1 second | Y | Y(all, query) | - | +| ratio_of_queries_less_than_1s | The ratio of queries when duration is less than 1 second | Y | Y(all, query) | - | +| count_of_queries_between_1s_and_3s | Total count of queries when duration is between 1 second and 3 seconds | Y | Y(all, query) | - | +| ration_of_queries_between_1s_and_3s | The ratio of queries when duration is between 1 second and 3 seconds | Y | Y(all, query) | - | +| count_of_queries_between_3s_and_5s | Total count of queries when duration is between 3 seconds and 5 seconds | Y | Y(all, query) | - | +| ratio_of_queries_between_3s_and_5s | The ratio of queries when duration is between 3 seconds and 5 seconds | Y | Y(all, query) | - | +| count_of_queries_between_5s_and_10s | Total count of queries when duration is between 5 seconds and 10 seconds | Y | Y(all, query) | - | +| ratio_of_queries_between_5s_and_10s | The ratio of queries when duration is between 5 seconds and 10 seconds | Y | Y(all, query) | - | +| count_of_queries_greater_than_10s | Total count of queries when duration exceeding 10 seconds | Y | Y(all, query) | - | +| ratio_of_queries_greater_than_10s | The ratio of queries when duration exceeding 10 seconds | Y | Y(all, query) | - | +| count_of_timeout_queries | The count of timeout queries | Y | Y(all, query) | - | +| count_of_failed_queries | The count of failed queries | Y | Y(all, query) | - | +| mean_time_of_query_per_host | The mean time of queries per host | N | Y(all, query) | - | +| 99%_of_query_latency | Query duration 99-percentile | Y | Y(all, query) | - | +| gt10s_query_rate_5-minute | Query duration exceeding 10s per second over 5 minutes | Y | Y(all, query) | - | +| failed_query_rate_5-minute | Failed queries per second over 5 minutes | Y | Y(all, query) | - | +| pushdown_query_rate_5-minute | Pushdown queries per second over 5 minutes | Y | Y(all, query) | - | +| scan_bytes_of_99%_queries | Query scan bytes 99-percentile | Y | Y(all, query) | - | +| query_scan_bytes_of_host | Query scan bytes per host | N | Y(all, query) |-| +| mean_scan_bytes_of_queries | The mean scan bytes of queries | Y | Y(all, query) | - | + +<span id="favorites">**Favorites**:Favorite Query related metrics</span> + +| Name | Meaning | Project related | Host related | Remark | +| :------------- | :---------- | :----------- | :----------- | :----------- | +| fq_accepted_total_times | Favorite Query user submitted total times | Y | Y(all, job, query) | - | +| fq_proposed_total_times | Favorite Query system triggered total times | Y | N | - | +| fq_proposed_total_duration | Favorite Query system triggered total duration | Y | N |-| +| failed_fq_proposed_total_times | Favorite Query system triggered failed total times | Y | N | Refer to the definition of "pushdown" | +| fq_adjusted_total_times | Favorite Query system adjusted total times | Y | Y(all, job, query) | - | +| fq_adjusted_total_duration | Favorite Query system adjusted total duration | Y | Y(all, job, query) | - | +| fq_update_usage_total_times | Favorite Query usage updated total times | Y | N | - | +| fq_update_usage_total_duration | Favorite Query usage updated total duration | Y | N | - | +| failed_fq_update_usage_total_times | Favorite Query usage updated failed total times | Y | N | - | +| fq_tobeaccelerated_num_gauge | Favorite Query to be accelerated | Y | N | - | +| fq_accelerated_num_gauge | Favorite Query accelerated | Y | N | - | +| fq_failed_num_gauge | Favorite Query accelerated failed times | Y | N | - | +| fq_accelerating_num_gauge | Favorite Query accelerating | Y | N | - | +| fq_pending_num_gauge | Favorite Query pending | Y | N | Favorite Query lacks of necessary conditions, such as missing column names, requiring user intervention | +| fq_blacklist_num_gauge | Favorite Query in blacklist | Y | N | Refer to the definition of "Blacklist" | + +<span id="jobs">**Jobs**:Job related metrics</span> + +| Name | Meaning | Project related | Host related | +| :------------- | :---------- | :----------- | :----------- | +| num_of_jobs_created | Jobs created total number | Y | Y(all, job) | +| num_of_jobs_finished | Jobs finished total number | Y | Y(all, job) | +| num_of_running_jobs | The num of running jobs currently | Y | N | +| num_of_pending_jobs | The num of pending jobs currently | Y | N | +| num_of_error_jobs | The num of error jobs currently | Y | N | +| count_of_error_jobs | The total count of error | Y | Y(all, job) | +| finished_jobs_total_duration | Jobs finished total duration | Y | Y(all, job) | +| job_duration_99p | Jobs duration 99-percentile | Y | Y(all, job) | +| job_step_attempted_total_times | Jobs step attempted total times | Y | Y(all, job) | +| failed_job_step_attempted_total_times | Jobs step attempted failed total times | Y | Y(all, job) | +| job_resumed_total_times | Jobs resumed total times | Y | Y(all, job) | +| job_discarded_total_times | Jobs discarded total times | Y | Y(all, job) | +| job_duration | The build duration of job | Y | Y(all, job) | +| job_wait_duration | The wait duration of job | Y | Y(all, job) | + +<span id="cleanings">**Cleanings**:Cleanup mechanisms related metrics</span> + +| Name | Meaning | Project related | Host related | +| :------------- | :---------- | :----------- | :----------- | +| storage_clean_total_times | Storage cleanup total times | N | Y(all, job, query) | +| storage_clean_total_duration | Storage cleanup total duration | N | Y(all, job, query) | +| failed_storage_clean_total_times | Storage cleanup failed total times | N | Y(all, job, query) | + +<span id="metadata">**Metadata Operations**:Metadata operations related metrics</span> + +| Name | Meaning | Project related | Host related | Remark | +| :------------- | :---------- | :----------- | :----------- | :----------- | +| metadata_clean_total_times | Metadata cleanup total times | Y | Y(all, job, query) | - | +| metadata_backup_total_times | Metadata backup total times | Y | Y(all, job, query) | Differentiate projects and global | +| metadata_backup_total_duration | Metadata backup total duration | Y | Y(all, job, query) | Differentiate projects and global | +| failed_metadata_backup_total_times | Metadata backup failed total times | Y | Y(all, job, query) | Differentiate projects and global | +| metadata_ops_total_times | Metadata daily operations total times | N | Y(all, job, query) | Fixed time per day (configurable): automatically backup metadata; rotate audit_log; cleanup metadata and storage space; adjust FQ; cleanup query histories. | +| metadata_success_ops_total_times | Metadata daily operations failed total times | N | Y(all, job, query) |-| + +<span id="transactions">**Transactions**:Transaction mechanisms related metrics</span> + +| Name | Meaning | Project related | Host related | Remark | +| :------------- | :---------- | :----------- | :----------- |:----------- | +| transaction_retry_total_times | Transactions retried total times | Y | Y(all, job, query) | Differentiate projects, and, global | +| transaction_latency_99p | Transactions duration 99-percentile | Y | Y(all, job, query) | Differentiate projects, and, global | diff --git a/website/docs/operations/monitoring/service.md b/website/docs/operations/monitoring/service.md new file mode 100644 index 0000000000..d1194457cf --- /dev/null +++ b/website/docs/operations/monitoring/service.md @@ -0,0 +1,181 @@ +--- +title: Service Monitoring +language: en +sidebar_label: Service Monitoring +pagination_label: Service Monitoring +toc_min_heading_level: 2 +toc_max_heading_level: 6 +pagination_prev: null +pagination_next: null +keywords: + - service monitoring +draft: true +last_update: + date: 12/08/2022 +--- + +## Service Monitoring + +Kylin provides the service monitoring for main components to help administrators obtain the service status and maintain instances. + +Currently, we provide the following methods to monitor the core components in Kylin: + +1. Query: each Query node will records its service status in InfluxDB +2. Build: each All node will records the service status and job status in InfluxDB + +Two Rest APIs are provided to monitor and obtain the service status so that customers can integrate it with their own monitor platform. + +- Get the Kylin cluster status by monitor query and building services. If the status is `WARNING` or `CRASH`, it means the cluster is unstable. +- Get the service unavailable time with the specified time range and some detailed monitor data to help admins to track and retrospect. + +### How to Use + +**Get Cluster Status** + +`GET http://host:port/kylin/api/monitor/status` + +- HTTP Header + + - Accept: application/vnd.apache.kylin-v4-public+json + - Accept-Language: en + - Content-Type: application/json;charset=utf-8 + +- Curl Request Example + + ``` + curl -X GET \ + 'http://host:port/kylin/api/monitor/status' \ + -H 'Accept: application/vnd.apache.kylin-v4-public+json' \ + -H 'Accept-Language: en' \ + -H 'Authorization: Basic QURNSU46S1lMSU4=' \ + -H 'Content-Type: application/json;charset=utf-8' + ``` + +- Response Details + + - `active_instances` number of active instances in current cluster. + - `query_status` query service status. It could be GOOD / WARNING / CRASH + - `job_status` building service status. It could be GOOD / WARNING / CRASH. + - `Job` job instance status. It will show the instance details and status. + - `query` query instance status. It will show the instance details and status. + +- Response Example + + ```json + { + "code": "000", + "data": { + "active_instances": 1, + "query_status": "GOOD", + "job_status": "GOOD", + "job": [ + { + "instance": "sandbox.hortonworks.com:7070", + "status": "GOOD" + } + ], + "query": [ + { + "instance": "sandbox.hortonworks.com:7070", + "status": "GOOD" + } + ] + }, + "msg": "" + } + ``` + + + +**Get Cluster Status with Specific Time Range** + +`GET http://host:port/kylin/api/monitor/status/statistics` + +- HTTP Header + + - Accept: application/vnd.apache.kylin-v4-public+json + - Accept-Language: en + - Content-Type: application/json;charset=utf-8 + +- URL Parameters + + - `start` - `required` `Long` timestamp. Get the monitor data greater than or equal to the timestamp. + - `end` - `reuquired` `Long` timestamp. Get the monitor data smaller than the timestamp. + +- Curl Example + + ``` + curl -X GET \ + 'http://host:port/kylin/api/monitor/status/statistics?start=1583562358466&end=1583562358466' \ + -H 'Accept: application/vnd.apache.kylin-v4-public+json' \ + -H 'Accept-Language: en' \ + -H 'Authorization: Basic QURNSU46S1lMSU4=' \ + -H 'Content-Type: application/json;charset=utf-8' + ``` + +- Response Details + + - `Start` start time of monitoring. It will be rounded down based on the interval of monitoring data. If the interval is 1 minute, it will only record data in minute level. For example, if the argument is `1587353550000`, it will be recognized as `1587353520000`. Therefore, the data might be inaccurate. + - `end` end time of monitoring. It will be rounded down based on the interval of monitoring data. If the interval is 1 minute, it will only record data in minute level. For example, if the argument is `1587353550000`, it will be recognized as `1587353520000`. Therefore, the data might be inaccurate. + - `interval` interval of monitor data, default value is 60000 ms (1 min) + - `job` job instance status. It will show the instance details and status, which includes unavailable time and counts. The time unit of unavailable time is ms. + - `query` query instance status. It will show the instance details and status, which includes unavailable time and counts. The time unit of unavailable time is ms. + +- Response Example + + ``` + { + "code":"000", + "data":{ + "start":1584151560000, + "end":1584151680000, + "interval":60000, + "job":[ + { + "instance":"sandbox.hortonworks.com:7070", + "details":[ + { + "time":1584151572650, + "status":"GOOD" + }, + { + "time":1584151632770, + "status":"GOOD" + } + ], + "unavailable_time":0, + "unavailable_count":0 + } + ], + "query":[ + { + "instance":"sandbox.hortonworks.com:7070", + "details":[ + { + "time":1584151609142, + "status":"GOOD" + }, + { + "time":1584151669142, + "status":"GOOD" + } + ], + "unavailable_time":0, + "unavailable_count":0 + } + ] + }, + "msg":"" + } + ``` + + + + +### Know Limitation + +1. The detected query is constant query which will not scan HDFS files. +2. InfluxDB is not high available now. Hence, some monitor data will be lost if the InfluxDB service is down. +3. The job status will be inaccurate if deleting or discarding plenty of jobs. +4. Since system monitoring depends on InfluxDB, if the system monitoring is still enabled (enabled by default) when InfluxDB is not configured, some useless errors may appear in the log. So when InfluxDB is not configured, it is recommended to configure `kylin.monitor.enabled = false` in `kylin.properties` to turn off the system monitoring function. +