[kylin] 03/03: KYLIN-5221, add monitoring in operations

xxyu Fri, 12 Aug 2022 03:09:42 -0700

This is an automated email from the ASF dual-hosted git repository.

xxyu pushed a commit to branch doc5.0
in repository https://gitbox.apache.org/repos/asf/kylin.git


commit e840d1fb2af5a143f8b43514a8fa699227da43da
Author: Mukvin <boyboys...@163.com>
AuthorDate: Fri Aug 12 18:04:29 2022 +0800

    KYLIN-5221, add monitoring in operations
---
 .../operations/monitoring/images/dashboard.jpg     | Bin 0 -> 235619 bytes
 .../docs/operations/monitoring/images/interval.png | Bin 0 -> 162846 bytes
 .../operations/monitoring/influxdb/influxdb.md     | 211 +++++++++++++++++++++
 .../monitoring/influxdb/influxdb_maintenance.md    | 124 ++++++++++++
 .../docs/operations/monitoring/influxdb/intro.md   |  22 +++
 website/docs/operations/monitoring/intro.md        |  24 +++
 .../docs/operations/monitoring/metrics_intro.md    | 209 ++++++++++++++++++++
 website/docs/operations/monitoring/service.md      | 181 ++++++++++++++++++
 8 files changed, 771 insertions(+)

diff --git a/website/docs/operations/monitoring/images/dashboard.jpg 
b/website/docs/operations/monitoring/images/dashboard.jpg
new file mode 100644
index 0000000000..9d893aaa28
Binary files /dev/null and 
b/website/docs/operations/monitoring/images/dashboard.jpg differ
diff --git a/website/docs/operations/monitoring/images/interval.png 
b/website/docs/operations/monitoring/images/interval.png
new file mode 100644
index 0000000000..f89954398b
Binary files /dev/null and 
b/website/docs/operations/monitoring/images/interval.png differ
diff --git a/website/docs/operations/monitoring/influxdb/influxdb.md 
b/website/docs/operations/monitoring/influxdb/influxdb.md
new file mode 100644
index 0000000000..d0636f0f64
--- /dev/null
+++ b/website/docs/operations/monitoring/influxdb/influxdb.md
@@ -0,0 +1,211 @@
+---
+title: Use InfluxDB as Time-Series Database
+language: en
+sidebar_label: Use InfluxDB as Time-Series Database
+pagination_label: Use InfluxDB as Time-Series Database
+toc_min_heading_level: 2
+toc_max_heading_level: 6
+pagination_prev: null
+pagination_next: null
+keywords:
+    - influxdb
+draft: false
+last_update:
+    date: 12/08/2022
+---
+
+
+### <span id="preparation">Preparation</span>
+
+Starting with Kylin 5.0, the system uses RDBMS to store query history, which 
only use InfluxDB to record the monitoring information of the system. 
+
+If you need this information, you need to configure the time series database 
InfluxDB in advance to store data such as the monitoring information of the 
system.
+
+We recommend you to use InfluxDB v1.6.4, which is provided in the Kylin 
installation package. 
+
+The InfluxDB installation package, `influxdb-1.6.4.x86_64.rpm` is under the 
`influxdb` directory in the installation directory of Kylin.
+
+If you need to use the existed InfluxDB database in your environment, please 
use the versions below:
+
+- InfluxDB 1.6.4 or above
+
+You can use the following command to check the version of InfluxDB in your 
current environment.
+
+```shell
+service influxdb version
+```
+
+### <span id="root">Installation and Configuration for `root` User</span>
+
+The following steps are using InfluxDB 1.6.4 as an example.
+
+1. Run command to check if InfluxDB is installed already.
+
+      ```shell
+      service influxdb status
+      ```
+
+   If not, you can go to the directory where the InfluxDB installation package 
is located and install InfluxDB.
+
+   ```shell
+   cd $KYLIN_HOME/influxdb
+   rpm -ivh influxdb-1.6.4.x86_64.rpm
+   ```
+
+2. Launch InfluxDB. 
+
+   ```sh
+   service influxdb start
+   ```
+
+   By default, you can find InfluxDB's log at `/var/log/influxdb`.
+
+3. If your InfluxDB server port is in use, you can modify the InfluxDB 
configuration file to change the server port.
+
+   ```sh
+   vi /etc/influxdb/influxdb.conf
+   ```
+
+   Please note the following three points:
+
+   - Modify RPC port: The initial property is `bind-address = 
"127.0.0.1:8088"`, you can change `8088` to an available port, for instance, 
`18087`.
+   - Modify HTTP port: The initial property is `bind-address = ":8086"`, you 
can change `8086` to available port, for instance, `18086`.
+   - Set `reporting-disabled = true`, which means that the InfluxDB will not 
send reports to [influxdata.com](https://www.influxdata.com/) regularly.
+
+4. InfluxDB is accessible without a user name and password by default. If you 
want to strengthen the security level, you can set a password with the 
following steps:
+
+  1. log in InfluxDB.
+
+       ```sh
+       influx -port 18086 
+       ```
+       
+       **Tips:** Please replace `18086` with an actually available port number.
+       
+  2. Manage admin user and password.
+
+     ```mariadb
+     CREATE USER admin WITH PASSWORD 'admin' WITH ALL PRIVILEGES
+     ```
+
+  3. Open the configuration file and modify ` [http] auth-enabled = true` to 
enable authorization.
+
+     ```sh
+     vi /etc/influxdb/influxdb.conf 
+     ```
+
+  4. Restart InfluxDB to take effect and login InfluxDB.
+
+     ```sh
+     service influxdb restart
+     influx -port 18086 -username admin -password admin 
+     ```
+
+5. Open the property file `kylin.properties` and modify the InfluxDB 
configurations. Please replace `ip:http_port`, `user`, `pwd`, `ip:rpc_port` 
with real values.
+
+  ```properties
+   vi $KYLIN_HOME/conf/kylin.properties 
+   
+   ### Modify the following properties
+   
+   kylin.influxdb.address=ip:http_port
+   kylin.influxdb.username=user
+   kylin.influxdb.password=pwd
+   kylin.metrics.influx-rpc-service-bind-address=ip:rpc_port
+  ```
+
+
+  **Note**: If more than one Kylin instances are deployed, you should 
configure the above configurations in `kylin.properties` for each Kylin node 
and let them point to the same Influxdb instance.
+
+6. Encrypt influxdb password
+
+   If you need to encrypt influxdb's password, you can do it like this：
+   
+   **i.** run following commands in ${KYLIN_HOME}, it will print encrypted 
password
+   ```shell
+   ./bin/kylin.sh org.apache.kylin.tool.general.CryptTool -e AES -s <password>
+   ```
+
+   **ii.** config kylin.influxdb.password like this
+   ```properties
+   kylin.influxdb.password=ENC('${encrypted_password}')
+   ```
+   
+   **iii.** Here is an example, assuming influxdb's password is kylin
+   
+   First, we need to encrypt kylin using the following command   
+    ```shell
+    ${KYLIN_HOME}/bin/kylin.sh org.apache.kylin.tool.general.CryptTool -e AES 
-s kylin
+    AES encrypted password is:
+    YeqVr9MakSFbgxEec9sBwg==
+    ```
+    Then, config kylin.metadata.url like this：
+    ```properties
+   kylin.influxdb.password=ENC('YeqVr9MakSFbgxEec9sBwg==')
+    ```
+
+7. start Kylin.
+
+### <span id="not_root">Installation and Configuration for Non `root` User 
</span>
+
+The following steps are using InfluxDB 1.6.4 as an example.
+
+
+1. Suppose you install as user `abc` . Then create a directory 
`home/abc/influx` to copy the InfluxDB installation package, 
`influxdb-1.6.4.x86_64.rpm` ,from `$KYLIN_HOME/influxdb` to this directory.
+
+   ```sh
+   mkdir /home/abc/influx
+   cp $KYLIN_HOME/influxdb/influxdb-1.6.4.x86_64.rpm /home/abc/influx
+   cd /home/abc/influx
+   rpm2cpio influxdb-1.6.4.x86_64.rpm | cpio -idmv
+   ```
+
+2. Edit the InfluxDB configuration file and replace `/var/lib` with 
`/home/abc/influx` globally. Also, you can modify `bind-address` property 
according to your case.
+
+   ```sh
+   vi /home/abc/influx/etc/influxdb/influxdb.conf
+   ```
+
+3. Run following command to launch InfluxDB.
+
+  ```sh
+   nohup ./usr/bin/influxd run -config 
/home/abc/influx/etc/influxdb/influxdb.conf &
+  ```
+  By default, you can find InfluxDB's log at 
`/home/abc/influx/var/log/influxdb`.
+
+4. As for other configurations, please refer to the second part in this 
section [`root` User Installation And Configuration](#root). Note that if you 
want to restart influxdb, you need to execute the following commands. Using 
`service influxdb restart` will not work since it requires root permission.
+
+   ```sh
+   ps -ef | grep influxdb
+   kill {pid}
+   ```
+
+5. Launch Kylin.
+
+
+### <span id="service">InfluxDB Connectivity</span>     
+To ensure the connectivity of InfluxDB service, it is recommended that you 
perform some tests after starting InfluxDB.
+- Log in to InfluxDB by entering the command line in the terminal:
+  ```sh
+  /home/abc/influx/usr/bin/influx -port 18086 -username ${username} -password 
${pwd}
+  ```
+  If the login fails, you can set `auth-enabled = true` in the configuration 
file `influxdb.conf` and try to login again.
+- After successful login , you can execute some simple queries to check if 
InfluxDB is configured correctly:
+  ```sql
+  show databases;
+  ```
+  If the query fails and the message `authorization failed` is displayed, 
please confirm whether the user has sufficient permissions.
+
+For more information about InfluxDB connectivity, please refer to the 
[InfluxDB 
Maintenance](../../Operation-and-Maintenance-Guide/influxdb/influxdb_maintenance.en.md)
 section.
+
+
+
+### <span id="https">（optional）Configure HTTPS connection to InfluxDB </span>  
+
+Before using HTTPS to connect to InfluxDB, you need to enable InfluxDB's HTTPS 
connection. To enable HTTPS for InfluxDB please refer to the official 
documentation: [Enabling HTTPS with 
InfluxDB](https://docs.influxdata.com/influxdb/v1.6/administration/https_setup/)。
+
+If the InfluxDB you are using has enabled HTTPS connection, please set the 
following parameters in the `$KYLIN_HOME/conf/kylin.properties` configuration 
file:
+
+```
+kylin.influxdb.https.enabled=true
+```
diff --git 
a/website/docs/operations/monitoring/influxdb/influxdb_maintenance.md 
b/website/docs/operations/monitoring/influxdb/influxdb_maintenance.md
new file mode 100644
index 0000000000..632a6c675d
--- /dev/null
+++ b/website/docs/operations/monitoring/influxdb/influxdb_maintenance.md
@@ -0,0 +1,124 @@
+---
+title: InfluxDB Maintenance
+language: en
+sidebar_label: InfluxDB Maintenance
+pagination_label: InfluxDB Maintenance
+toc_min_heading_level: 2
+toc_max_heading_level: 6
+pagination_prev: null
+pagination_next: null
+keywords:
+       - influxdb
+draft: false
+last_update:
+    date: 12/08/2022
+---
+
+
+## InfluxDB Maintenance
+
+This chapter introduces the basic maintenance of InfluxDB.
+
+### Connectivity
+
+When InfluxDB is not accessible, you can locate the problem from the following 
aspects:
+
+1. Check if InfluxDB is running normally by executing `service influxdb 
status`. If it is not running, please check log files of 
`/var/log/influxdb/influxd.log` or `/var/log/messages` to find out the reason, 
at the same time, run `service influxdb restart` to restart InfluxDB service 
and make sure the service can be launched normally by observing the logs. (You 
should be able to login InfluxDB via `influx -host ? -port ?` command)
+
+2. If you find the port has been taken in the starting process, run `netstat 
-anp | grep influxdb_port` to get the process id, and execute `ps -ef | grep 
pid` to get the specific process. You can choose to kill the process if you do 
not need it or to change InfluxDB's server port to another.
+
+3. If you are having your Kylin and InfluxDB installed in different nodes, 
please execute `telnet influxdb_ip influxdb_port` on Kylin node to check if two 
nodes can communicate normally, if not, please make sure the Firewall service 
is not turned on on InfluxDB node via `service iptables status` command or 
contact the system admin to check the network condition.
+
+### Log Management
+
+- **Log Configuration**
+
+       - By default, InfluxDB writes standard error to log. InfluxDB redirects 
stderr to `/var/log/influxdb/influxd.log` file when it is started. If you would 
like to change the log path, please modify the property in the configuration 
file `/etc/default/influxdb` to `STDERR=/path/to/influxdb.log`, and restart the 
service via `service influxdb restart` command.
+       - InfluxDB enables HTTP access log by default.  Generally, HTTP access 
log is quite large, you can modify the property `[http] log-enabled=false` to 
disable the log output.
+
+- **Log Clean**
+
+       InfluxDB itself does not clean its log regularly, it uses **logrotate** 
to manage log, which is installed on Linux system by default. The configuration 
file of **logrotate** is located at `/etc/logrotate.d/influxdb`, the log 
rotates by day, and the retention is 7 days.
+
+### Backup and Restore
+
+InfluxDB provides the availability to do backup and restore.
+
+- **Backup**
+
+       ```sh
+       influxd backup -portable -database KE_METRIC -host 127.0.0.1:8089 
/path/to/backup
+       ```
+
+- **Restore**
+
+       Please make sure that the database exists, otherwise the restore will 
be failed.
+
+       ```sh
+       influxd restore -portable -database KE_METRIC   -host 127.0.0.1:8089 
/path/to/backup
+       ```
+
+> **note:** Please replace KE_METRIC with the actual database name, replace 
127.0.0.1:8089 with the actual IP and port, replace `/path/to/backup` with the 
path you would like to set.
+
+### Monitoring and Diagnosis
+
+- **Memory Monitoring**
+
+       - Check runtime
+
+         Run following command to check GC, memory usage, etc.
+         `influx -database KE_METRIC -execute "show stats for 'runtime'"`
+         
+         Please focus on these important arguments:
+         - *HeapAlloc* -> Heap allocation size
+         - *Sys* -> The total number of bytes of memory obtained from the 
system
+         - *NumGC* -> GC times
+         - *PauseTotalNs* -> The total GC pause time
+
+       - Check the memory usage of InfluxDB index
+
+         `show stats for 'indexes'`
+         
+       - Monitor InfluxDB memory usage
+
+         Run following command:
+         
+         `pidstat -rh -p PID 5`
+         
+         If the memory usage is too high or GC is too  frequent, please 
increase memory.
+         
+         > **tips:** It is recommended to install InfluxDB on a separate 
machine with high memory allocation, because data read and write speed are 
dependent on the indexes, and the indexes are stored in memory.
+       
+- **Disk Monitoring**
+
+  Run following command to check disk situation:
+
+  ```sh
+  pidstat -d -p PID 5
+  ```
+
+  When the disk read/write load is found to be too high, you can consider 
mapping the WAL directory and the data directory to different disks to reduce 
the interaction between read and write operations.
+
+  1. Run `vi /etc/default/influxdb` to edit the configuration file.
+  2. Modify the properties `[data] dir = "/var/lib/influxdb/data"` and 
`wal-dir = "/var/lib/influxdb/wal"` to point WAL directory and data directory 
to different disk.
+
+- **Read/Write Response Time**
+
+       1. Write: 
+       
+          ```sql
+          SELECT non_negative_derivative(percentile("writeReqDurationNs", 99)) 
/ non_negative_derivative(max("writeReq")) / (1000 * 1000) AS "Write Request" 
+          FROM "_internal".."httpd" 
+          WHERE time > now() - 10d 
+          GROUP BY time(1h) fill(0)
+          ```
+       
+       2. Read: 
+       
+     ```sql
+          SELECT non_negative_derivative(percentile("queryReqDurationNs", 99)) 
/ non_negative_derivative(max("queryReq")) / (1000 * 1000) AS "Query Request" 
+          FROM "_internal".."httpd" 
+          WHERE time > now() - 10d 
+          GROUP BY time(1h)
+          ```
+       
diff --git a/website/docs/operations/monitoring/influxdb/intro.md 
b/website/docs/operations/monitoring/influxdb/intro.md
new file mode 100644
index 0000000000..96ea606197
--- /dev/null
+++ b/website/docs/operations/monitoring/influxdb/intro.md
@@ -0,0 +1,22 @@
+---
+title: InfluxDB
+language: en
+sidebar_label: InfluxDB
+pagination_label: InfluxDB
+toc_min_heading_level: 2
+toc_max_heading_level: 6
+pagination_prev: null
+pagination_next: null
+keywords:
+    - monitoring
+    - maintenance
+draft: false
+last_update:
+    date: 12/08/2022
+---
+
+Kylin supports to use InfluxDB as its time series database, this chapter will 
cover:
+
+* [Use InfluxDB as Time-Series Database](influxdb.md) 
+* [InfluxDB Maintenance](influxdb_maintenance.md)
+
diff --git a/website/docs/operations/monitoring/intro.md 
b/website/docs/operations/monitoring/intro.md
new file mode 100644
index 0000000000..2d2d302d57
--- /dev/null
+++ b/website/docs/operations/monitoring/intro.md
@@ -0,0 +1,24 @@
+---
+title: Monitoring
+language: en
+sidebar_label: Monitoring
+pagination_label: Monitoring
+toc_min_heading_level: 2
+toc_max_heading_level: 6
+pagination_prev: null
+pagination_next: null
+keywords:
+    - monitoring
+    - maintenance
+draft: false
+last_update:
+    date: 12/08/2022
+---
+
+This chapter will discuss how to do system monitoring, we will cover:
+
+* [InfluxDB](influxdb/intro.md)
+    * [Use InfluxDB as Time-Series Database](influxdb/influxdb.md)
+    * [InfluxDB Maintenance](influxdb/influxdb_maintenance.md)
+* [Metrics Monitoring](metrics_intro.md)
+* [Service Monitoring](service.md)
diff --git a/website/docs/operations/monitoring/metrics_intro.md 
b/website/docs/operations/monitoring/metrics_intro.md
new file mode 100644
index 0000000000..cee36b3265
--- /dev/null
+++ b/website/docs/operations/monitoring/metrics_intro.md
@@ -0,0 +1,209 @@
+---
+title: Metrics Monitoring
+language: en
+sidebar_label: Metrics Monitoring
+pagination_label: Metrics Monitoring
+toc_min_heading_level: 2
+toc_max_heading_level: 6
+pagination_prev: null
+pagination_next: null
+keywords:
+    - metrics monitoring
+draft: true
+last_update:
+    date: 12/08/2022
+---
+
+
+By default, the system collects metric data every minute, including storage, 
query, job, metadata, and cleanup mechanism. The monitoring data is stored in 
the specified [InfluxDB](https://www.influxdata.com/time-series-platform/) and 
displayed through [Grafana](https://grafana.com/grafana). It can help 
administrators to understand the health of the system in order to take 
necessary actions.
+
+> **Note**: Since Grafana depends on InfluxDB, please make sure that InfluxDB 
is correctly configured and started according to [Use InfluxDB as Time-Series 
Database](influxdb/influxdb.md) before you use Grafana.
+
+### <span id="grafana_startup">Grafana</span>
+
+1. **Working Directory**: ``` $KYLIN_HOME/grafana```
+2. **Configuration Directory**: ```$KYLIN_HOME/grafana/conf```
+3. **Start Grafana Command**: ```$ KYLIN_HOME/bin/grafana.sh start```
+4. **Stop Grafana Command**: ``` $ KYLIN_HOME/bin/grafana.sh stop```
+
+> Changing grafana configuration please refer to 
[Configuration](https://grafana.com/docs/installation/configuration/).
+
+After the startup is successful, you may access Grafana through web browser 
with default port: 3000, username: admin, password: admin
+
+[comment]: <#TODO> (![metrics_dashboard]&#40;images/dashboard.jpg&#41;)
+
+### <span id="dashboard">Dashboard</span>
+
+Default Dashboard: ```Kylin```
+
+The dashboard consists of 10 modules: Cluster, Summaries, Models, Queries, 
Favorites, Jobs, Cleanings, Metadata Operations, Transactions, among which 
Summaries module is automatically displayed in detail. Read more details about 
the modules, please refer to [Metrics Explanation](#explanation). If you want 
to make some changes for the dashboard, please refer to Grafana official 
website manual [Provisioning 
Grafana](https://grafana.com/docs/administration/provisioning/). 
+
+### <span id="panel">Panel</span>
+
+Each indicator monitor corresponds to a specific panel.
+
+### <span id="__interval">Time Range</span>
+
+In the upper right corner of the dashboard, choose the time range. Time range: 
the time interval in which the indicator was observed.
+![metrics_interval](images/interval.png)
+
+### <span id="granularity">Data Granularity</span>
+
+Located in the upper left corner of the dashboard, the data granularity: auto, 
1m, 5m, 10m, 30m, 1h, 6h, 12h, 1d, 7d, 14d, 30d ('auto' is automatically 
adjusted according to the time range, such as the time range '30min' 
corresponding granularity 5min, and the granularity corresponding to the time 
range of 24h is 4h).
+
+### <span id="explanation">Metrics Explanation</span>
+
+- [**Cluster**: Cluster overiew](#cluster)
+- [**Summaries**: Global overview](#summaries)
+- [**Models**: Model related metrics](#models)
+- [**Queries**: Query related metrics](#queries)
+- [**Favorites**: Favorite Query related metrics](#favorites)
+- [**Jobs**: Job related metrics](#jobs)
+- [**Cleanings**: Cleanup mechanisms related metrics](#cleanings)
+- [**Metadata Operations**: Metadata operations related metrics](#metadata)
+- [**Transactions**: Transaction mechanisms related metrics](#transactions)
+
+> **Tip **: “Project related” in the following table indicates whether the 
metric is related to the project, “Y” indicates that the metric is related to 
the project, and “N” indicates that the metric is not related to the project. 
"Host related" in the following table indicates whether the metric is related 
to Kylin nodes, "Y" indicates that the metric is related to the Kylin nodes, 
"N" indicates that the metric is not related to the host. "all", "job", "query" 
is Kylin nodes' server mode.
+
+<span id="cluster">**Cluster**：Cluster overview</span>
+
+| Name       | Meaning    | Project related     |
+| :------------- | :---------- | :----------- |
+| build_unavailable_duration | the unavailable time of building | N |
+| query_unavailable_duration | the unavailable time of query | N |
+
+<span id="summaries">**Summaries**: Global overview</span>
+
+| Name       | Meaning    | Project related | Host related     | Remark    |
+| :------------- | :---------- | :----------- | :----------- | :----------- |
+|  summary_exec_total_times | Times of all indicators collected | N | Y(all, 
job, query) | The cost of collecting indicators |
+|  summary_exec_total_duration | Duration of all indicators collected | N | 
Y(all, job, query) | The cost of collecting indicators |
+| num_of_projects             | Total project number | N | N | - |
+|  storage_size_gauge | Storage used of the system | Y | N | - |
+| num_of_users                | Total user number | N | N | - |
+| num_of_hive_tables          | Total data table number | Y | N | - |
+| num_of_hive_databases       | Total database number | Y | N | - |
+| summary_of_heap             | The heap size of Kylin | N | Y(all, job, 
query) | - |
+| usage_of_heap               | The ratio of heap of Kylin | N | Y(all, job, 
query) | - |
+| count_of_garbage_collection | The count of garbage collection | N | Y(all, 
job, query) | - |
+| time_of_garbage_collection  | The total time of garbage collection | N | 
Y(all, job, query) | - |
+|  garbage_size_gauge | Storage used of garbage | Y   | N | Refer to the 
definition of "Garbage" |
+|  sparder_restart_total_times | "Sparder" restart times | N | Y(all, job, 
query) | "Sparder" is the internal query engine |
+|  query_load | spark sql load | N | Y(all, query) | - |
+|  cpu_cores | The number of cup cores for query configured in 
kylin.properties | N | Y(all, query) | Refer "Spark-related Configuration" |
+
+<span id="models">**Models**：Model related metrics</span>
+
+| Name       | Meaning    | Project related | Host related     |
+| :------------- | :---------- | :----------- | :----------- |
+|  model_num_gauge | "Model number: curve with time | Y | N |
+|  non_broken_model_num_gauge | "Healthy model number" curve with time | Y | N 
|
+| last_query_time_of_models  | The last query time of models | Y | N |
+| hit_count_of_models        | The query hit count of models | Y | N |
+| storage_of_models          | The storage of models | Y | N |
+| segments_num_of_models     | The num of segments of models | Y | N |
+| model_build_duration       | Total build time of models | Y | N |
+| model_wait_duration        | Total wait time of models | Y | N |
+| number_of_indexes          | indexes number of models | Y | N |
+| expansion_rate_of_models   | Expansion rate of models | Y | N |
+| model_build_duration (avg) | Avg build time of models | Y | N |
+
+<span id="queries">**Queries**：Query related metrics</span>
+
+| Name       | Meaning    | Project related | Host related    | Remark    |
+| :------------- | :---------- | :----------- | :----------- | :----------- |
+| count_of_queries                     | Total count of queries | Y | Y(all, 
query) | - |
+| num_of_query_per_host                | The num of query per host | N | 
Y(all, query) | - |
+| count_of_queries_hitting_agg_index   | The count of queries hitting agg 
index | Y | Y(all, query) | - |
+| ratio_of_queries_hitting_agg_index   | The ratio of queries hitting agg 
index | Y | Y(all, query) | - |
+| count_of_queries_hitting_table_index | The count of queries hitting table 
index | Y | Y(all, query) | - |
+| ratio_of_queries_hitting_table_index | The ratio of queries hitting table 
index | Y | Y(all, query) | - |
+| count_of_pushdown_queries            | The count of pushdown queries | Y | 
Y(all, query) | - |
+| ratio_of_pushdown_queries            | The ratio of pushdown queries | Y | 
Y(all, query) | - |
+| count_of_queries_hitting_cache       | The count of queries hitting cache | 
Y | Y(all, query) | - |
+| ratio_of_queries_hitting_cache       | The ratio of queries hitting cache | 
Y | Y(all, query) | - |
+| count_of_queries_less_than_1s        | Total count of queries when duration 
is less than 1 second | Y | Y(all, query) | - |
+| ratio_of_queries_less_than_1s        | The ratio of queries when duration is 
less than 1 second | Y | Y(all, query) | - |
+| count_of_queries_between_1s_and_3s   | Total count of queries when duration 
is between 1 second and 3 seconds | Y | Y(all, query) | - |
+| ration_of_queries_between_1s_and_3s  | The ratio of queries when duration is 
between 1 second and 3 seconds | Y | Y(all, query) | - |
+| count_of_queries_between_3s_and_5s   | Total count of queries when duration 
is between 3 seconds and 5 seconds | Y | Y(all, query) | - |
+| ratio_of_queries_between_3s_and_5s   | The ratio of queries when duration is 
between 3 seconds and 5 seconds | Y | Y(all, query) | - |
+| count_of_queries_between_5s_and_10s  | Total count of queries when duration 
is between 5 seconds and 10 seconds | Y | Y(all, query) | - |
+| ratio_of_queries_between_5s_and_10s  | The ratio of queries when duration is 
between 5 seconds and 10 seconds | Y | Y(all, query) | - |
+| count_of_queries_greater_than_10s    | Total count of queries when duration 
exceeding 10 seconds | Y | Y(all, query) | - |
+| ratio_of_queries_greater_than_10s    | The ratio of queries when duration 
exceeding 10 seconds | Y | Y(all, query) | - |
+| count_of_timeout_queries             | The count of timeout queries | Y | 
Y(all, query) | - |
+| count_of_failed_queries              | The count of failed queries | Y | 
Y(all, query) | - |
+| mean_time_of_query_per_host          | The mean time of queries per host | N 
| Y(all, query) | - |
+| 99%_of_query_latency                 | Query duration 99-percentile | Y | 
Y(all, query) | - |
+|  gt10s_query_rate_5-minute | Query duration exceeding 10s per second over 5 
minutes | Y | Y(all, query) | - |
+|  failed_query_rate_5-minute | Failed queries per second over 5 minutes | Y | 
Y(all, query) | - |
+|  pushdown_query_rate_5-minute | Pushdown queries per second over 5 minutes | 
Y | Y(all, query) | - |
+| scan_bytes_of_99%_queries            | Query scan bytes 99-percentile | Y | 
Y(all, query) | - |
+| query_scan_bytes_of_host             | Query scan bytes per host | N | 
Y(all, query) |-|
+| mean_scan_bytes_of_queries           | The mean scan bytes of queries | Y | 
Y(all, query) | - |
+
+<span id="favorites">**Favorites**：Favorite Query related metrics</span>
+
+| Name       | Meaning    | Project related | Host related    | Remark    |
+| :------------- | :---------- | :----------- | :----------- | :----------- |
+|  fq_accepted_total_times | Favorite Query user submitted total times | Y | 
Y(all, job, query) | - |
+|  fq_proposed_total_times | Favorite Query system triggered total times | Y | 
N | - |
+|  fq_proposed_total_duration | Favorite Query system triggered total duration 
| Y    | N |-|
+|  failed_fq_proposed_total_times | Favorite Query system triggered failed 
total times | Y | N | Refer to the definition of "pushdown" |
+|  fq_adjusted_total_times | Favorite Query system adjusted total times | Y | 
Y(all, job, query) | - |
+|  fq_adjusted_total_duration | Favorite Query system adjusted total duration 
| Y | Y(all, job, query) | - |
+|  fq_update_usage_total_times | Favorite Query usage updated total times | Y 
| N | - |
+|  fq_update_usage_total_duration | Favorite Query usage updated total 
duration | Y | N | - |
+|  failed_fq_update_usage_total_times | Favorite Query usage updated failed 
total times | Y | N | - |
+|  fq_tobeaccelerated_num_gauge         | Favorite Query to be accelerated | Y 
| N | - |
+|  fq_accelerated_num_gauge | Favorite Query accelerated | Y | N | - |
+|  fq_failed_num_gauge | Favorite Query accelerated failed times | Y | N | - |
+|  fq_accelerating_num_gauge | Favorite Query accelerating | Y | N | - |
+|  fq_pending_num_gauge | Favorite Query pending | Y | N | Favorite Query 
lacks of necessary conditions, such as missing column names, requiring user 
intervention |
+|  fq_blacklist_num_gauge | Favorite Query in blacklist | Y | N | Refer to the 
definition of "Blacklist" |
+
+<span id="jobs">**Jobs**：Job related metrics</span>
+
+| Name       | Meaning    | Project related | Host related     |
+| :------------- | :---------- | :----------- | :----------- |
+| num_of_jobs_created                   | Jobs created total number | Y | 
Y(all, job) |
+| num_of_jobs_finished                  | Jobs finished total number | Y | 
Y(all, job) |
+| num_of_running_jobs                   | The num of running jobs currently | 
Y | N |
+| num_of_pending_jobs                   | The num of pending jobs currently | 
Y | N |
+| num_of_error_jobs                     | The num of error jobs currently | Y 
| N |
+| count_of_error_jobs                   | The total count of error | Y | 
Y(all, job) |
+| finished_jobs_total_duration          | Jobs finished total duration | Y | 
Y(all, job) |
+|  job_duration_99p | Jobs duration 99-percentile | Y | Y(all, job) |
+|  job_step_attempted_total_times | Jobs step attempted total times | Y | 
Y(all, job) |
+|  failed_job_step_attempted_total_times | Jobs step attempted failed total 
times | Y | Y(all, job) |
+|  job_resumed_total_times | Jobs resumed total times | Y | Y(all, job) |
+|  job_discarded_total_times | Jobs discarded total times | Y | Y(all, job) |
+| job_duration                          | The build duration of job | Y | 
Y(all, job) |
+| job_wait_duration                     | The wait duration of job | Y | 
Y(all, job) |
+
+<span id="cleanings">**Cleanings**：Cleanup mechanisms related metrics</span>
+
+| Name       | Meaning    | Project related | Host related     |
+| :------------- | :---------- | :----------- | :----------- |
+|  storage_clean_total_times | Storage cleanup total times | N | Y(all, job, 
query) |
+|  storage_clean_total_duration | Storage cleanup total duration | N | Y(all, 
job, query) |
+|  failed_storage_clean_total_times | Storage cleanup failed total times | N | 
Y(all, job, query) |
+
+<span id="metadata">**Metadata Operations**：Metadata operations related 
metrics</span>
+
+| Name       | Meaning    | Project related | Host related     | Remark    |
+| :------------- | :---------- | :----------- | :----------- | :----------- |
+|  metadata_clean_total_times | Metadata cleanup total times | Y | Y(all, job, 
query) | - |
+|  metadata_backup_total_times | Metadata backup total times | Y | Y(all, job, 
query) | Differentiate projects and global |
+|  metadata_backup_total_duration | Metadata backup total duration | Y | 
Y(all, job, query) | Differentiate projects and global |
+|  failed_metadata_backup_total_times | Metadata backup failed total times | Y 
| Y(all, job, query) | Differentiate projects and global |
+|  metadata_ops_total_times | Metadata daily operations total times | N | 
Y(all, job, query) | Fixed time per day (configurable): automatically backup 
metadata; rotate audit_log; cleanup metadata and storage space; adjust FQ; 
cleanup query histories. |
+|  metadata_success_ops_total_times | Metadata daily operations failed total 
times | N | Y(all, job, query) |-|
+
+<span id="transactions">**Transactions**：Transaction mechanisms related 
metrics</span>
+
+| Name       | Meaning    | Project related | Host related     | Remark    |
+| :------------- | :---------- | :----------- | :----------- |:----------- |
+|  transaction_retry_total_times | Transactions retried total times | Y | 
Y(all, job, query) | Differentiate projects, and, global |
+|  transaction_latency_99p | Transactions duration 99-percentile | Y | Y(all, 
job, query) | Differentiate projects, and, global |
diff --git a/website/docs/operations/monitoring/service.md 
b/website/docs/operations/monitoring/service.md
new file mode 100644
index 0000000000..d1194457cf
--- /dev/null
+++ b/website/docs/operations/monitoring/service.md
@@ -0,0 +1,181 @@
+---
+title: Service Monitoring
+language: en
+sidebar_label: Service Monitoring
+pagination_label: Service Monitoring
+toc_min_heading_level: 2
+toc_max_heading_level: 6
+pagination_prev: null
+pagination_next: null
+keywords:
+  - service monitoring
+draft: true
+last_update:
+  date: 12/08/2022
+---
+
+## Service Monitoring
+
+Kylin provides the service monitoring for main components to help 
administrators obtain the service status and maintain instances.
+
+Currently, we provide the following methods to monitor the core components in 
Kylin:
+
+1. Query: each Query node will records its service status in InfluxDB
+2. Build: each All node will records the service status and job status in 
InfluxDB
+
+Two Rest APIs are provided to monitor and obtain the service status so that 
customers can integrate it with their own monitor platform.
+
+- Get the Kylin cluster status by monitor query and building services. If the 
status is `WARNING` or `CRASH`, it means the cluster is unstable.
+- Get the service unavailable time with the specified time range and some 
detailed monitor data to help admins to track and retrospect.
+
+### How to Use
+
+**Get Cluster Status**
+
+`GET http://host:port/kylin/api/monitor/status`
+
+- HTTP Header
+
+  - Accept: application/vnd.apache.kylin-v4-public+json
+  - Accept-Language: en
+  - Content-Type: application/json;charset=utf-8
+
+- Curl Request Example
+
+  ```
+  curl -X GET \
+  'http://host:port/kylin/api/monitor/status' \
+  -H 'Accept: application/vnd.apache.kylin-v4-public+json' \
+  -H 'Accept-Language: en' \
+  -H 'Authorization: Basic QURNSU46S1lMSU4=' \
+  -H 'Content-Type: application/json;charset=utf-8' 
+  ```
+
+- Response Details
+
+  - `active_instances` number of active instances in current cluster.
+  - `query_status` query service status. It could be GOOD / WARNING / CRASH
+  - `job_status` building service status. It could be GOOD / WARNING / CRASH.
+  - `Job` job instance status. It will show the instance details and status.
+  - `query` query instance status. It will show the instance details and 
status.
+
+- Response Example
+
+  ```json
+  {
+      "code": "000",
+      "data": {
+          "active_instances": 1,
+          "query_status": "GOOD",
+          "job_status": "GOOD",
+          "job": [
+              {
+                  "instance": "sandbox.hortonworks.com:7070",
+                  "status": "GOOD"
+              }
+          ],
+          "query": [
+              {
+                  "instance": "sandbox.hortonworks.com:7070",
+                  "status": "GOOD"
+              }
+          ]
+      },
+      "msg": ""
+  }
+  ```
+
+  
+
+**Get Cluster Status with Specific Time Range**
+
+`GET http://host:port/kylin/api/monitor/status/statistics`
+
+- HTTP Header
+
+  - Accept: application/vnd.apache.kylin-v4-public+json
+  - Accept-Language: en
+  - Content-Type: application/json;charset=utf-8
+
+- URL Parameters
+
+  - `start` - `required` `Long` timestamp. Get the monitor data greater than 
or equal to the timestamp.
+  - `end` - `reuquired` `Long` timestamp. Get the monitor data smaller than 
the timestamp.
+
+- Curl Example
+
+  ```
+  curl -X GET \
+    
'http://host:port/kylin/api/monitor/status/statistics?start=1583562358466&end=1583562358466'
 \
+    -H 'Accept: application/vnd.apache.kylin-v4-public+json' \
+    -H 'Accept-Language: en' \
+    -H 'Authorization: Basic QURNSU46S1lMSU4=' \
+    -H 'Content-Type: application/json;charset=utf-8'
+  ```
+
+- Response Details
+
+  - `Start` start time of monitoring. It will be rounded down based on the 
interval of monitoring data. If the interval is 1 minute, it will only record 
data in minute level. For example, if the argument is `1587353550000`, it will 
be recognized as `1587353520000`. Therefore, the data might be inaccurate.
+  - `end` end time of monitoring. It will be rounded down based on the 
interval of monitoring data. If the interval is 1 minute, it will only record 
data in minute level. For example, if the argument is `1587353550000`, it will 
be recognized as `1587353520000`. Therefore, the data might be inaccurate.
+  - `interval` interval of monitor data, default value is 60000 ms (1 min)
+  - `job` job instance status. It will show the instance details and status, 
which includes unavailable time and counts. The time unit of unavailable time 
is ms.
+  - `query` query instance status. It will show the instance details and 
status, which includes unavailable time and counts. The time unit of 
unavailable time is ms.
+
+- Response Example
+
+  ```
+  {
+      "code":"000",
+      "data":{
+          "start":1584151560000,
+          "end":1584151680000,
+          "interval":60000,
+          "job":[
+              {
+                  "instance":"sandbox.hortonworks.com:7070",
+                  "details":[
+                      {
+                          "time":1584151572650,
+                          "status":"GOOD"
+                      },
+                      {
+                          "time":1584151632770,
+                          "status":"GOOD"
+                      }
+                  ],
+                  "unavailable_time":0,
+                  "unavailable_count":0
+              }
+          ],
+          "query":[
+              {
+                  "instance":"sandbox.hortonworks.com:7070",
+                  "details":[
+                      {
+                          "time":1584151609142,
+                          "status":"GOOD"
+                      },
+                      {
+                          "time":1584151669142,
+                          "status":"GOOD"
+                      }
+                  ],
+                  "unavailable_time":0,
+                  "unavailable_count":0
+              }
+          ]
+      },
+      "msg":""
+  }
+  ```
+
+
+
+
+### Know Limitation
+
+1. The detected query is constant query which will not scan HDFS files.
+2. InfluxDB is not high available now. Hence, some monitor data will be lost 
if the InfluxDB service is down. 
+3. The job status will be inaccurate if deleting or discarding plenty of jobs.
+4. Since system monitoring depends on InfluxDB, if the system monitoring is 
still enabled (enabled by default) when InfluxDB is not configured, some 
useless errors may appear in the log. So when InfluxDB is not configured, it is 
recommended to configure `kylin.monitor.enabled = false` in `kylin.properties` 
to turn off the system monitoring function.
+

[kylin] 03/03: KYLIN-5221, add monitoring in operations

Reply via email to