提交 7405bfe7 编辑于 作者: Wang JiaJu's avatar Wang JiaJu
浏览文件

增加 infra 指标释义说明

上级 e5dfda8f
加载中
加载中
加载中
加载中
+78 −36
原始行号 差异行号 差异行
@@ -163,42 +163,84 @@ APM 指标主要反映业务服务在一定时间窗口内的请求与响应情

性能指标反映系统的底层资源使用情况,包括 Pod、Node、TiDB 组件在内的多种对象。下表列出了常见指标编码及其含义(以 Parquet 文件中的 kpi_key 为主):

| 对象类型 | 指标编码                        | 指标名称                      |
| -------- | ------------------------------- | ----------------------------- |
| pod      | pod_cpu_usage                   | CPU 使用率                    |
| pod      | pod_processes                   | 进程数                        |
| pod      | pod_memory_working_set_bytes    | 内存使用大小                  |
| pod      | pod_fs_writes_bytes             | 写入字节的累积计数            |
| pod      | pod_fs_reads_bytes              | 累计读取字节数                |
| pod      | pod_network_receive_bytes       | 接收字节的累积计数            |
| pod      | pod_network_transmit_bytes      | 传输字节的累积计数            |
| pod      | pod_network_receive_packets     | 接收数据包的累积计数          |
| pod      | pod_network_transmit_packets    | 传输数据包的累积计数          |
| node     | node_cpu_usage_rate             | CPU 使用率                    |
| node     | node_memory_usage_rate          | 内存使用率                    |
| node     | node_filesystem_usage_rate      | 磁盘使用率                    |
| node     | node_memory_MemAvailable_bytes  | 空闲内存大小                  |
| node     | node_memory_MemTotal_bytes      | 内存总大小                    |
| node     | node_filesystem_size_bytes      | 存储设备总大小                |
| node     | node_filesystem_free_bytes      | 存储设备空闲大小              |
| node     | node_disk_read_bytes_total      | 成功读取的字节数              |
| node     | node_disk_read_time_seconds_total | 磁盘分区读取花费的秒数       |
| node     | node_disk_written_bytes_total   | 成功写入的字节数              |
| node     | node_disk_write_time_seconds_total | 磁盘分区写操作花费的秒数   |
| node     | node_network_receive_bytes_total | {{device}} 接口接收速率      |
| node     | node_network_receive_packets_total | {{device}} 接口每秒接收的数据包总数 |
| node     | node_network_transmit_bytes_total | {{device}} 接口发送速率      |
| node     | node_network_transmit_packets_total | {{device}} 接口每秒发送的数据包总数 |
| node     | node_sockstat_TCP_inuse         | TCP_inuse – 正在使用(正在侦听)的 TCP 套接字数量 |
| tidb     | connection_count                | 连接数                        |
| tidb     | failed_query_ops                | 失败请求数                    |
| tidb     | duration_99th                   | 99 分位请求延迟               |
| tidb     | duration_95th                   | 95 分位请求延迟               |
| tidb     | duration_avg                    | 平均请求延迟                  |
| tidb     | qps                             | 请求数量                      |
| tidb     | slow_query                      | 慢查询                        |
| tidb     | block_cache_size                | Block Cache 大小             |

| object_type | kpi_key                          | kpi_name               | promql                                                                                                                      |
|-------------|----------------------------------|------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| pod         | pod_cpu_usage                    | CPU 使用率             | `sum(irate(container_cpu_usage_seconds_total{container=~"server\|redis"}[1m])) by (namespace,instance,pod)`               |
| pod         | pod_processes                    | 进程数                 | `sum(container_processes{container=~"server\|redis"}) by (namespace,instance,pod)`                                         |
| pod         | pod_memory_working_set_bytes     | 内存使用大小           | `sum(rate(container_memory_working_set_bytes{container=~"server\|redis"}[1m])) by (namespace,instance,pod)`                |
| pod         | pod_fs_writes_bytes              | 写入字节的累积计数     | `sum(irate(container_fs_writes_bytes_total{container=~"server\|redis"}[1m])) by (namespace,instance,pod,device)`         |
| pod         | pod_fs_reads_bytes               | 累计读取字节数         | `sum(irate(container_fs_reads_bytes_total{container=~"server\|redis"}[1m])) by (namespace,instance,pod,device)`          |
| pod         | pod_network_receive_bytes        | 接收字节的累积计数     | `sum(irate(container_network_receive_bytes_total{namespace="hipstershop",interface="eth0"}[1m])) by (namespace,instance,pod)` |
| pod         | pod_network_transmit_bytes       | 传输字节的累积计数     | `sum(irate(container_network_transmit_bytes_total{namespace="hipstershop",interface="eth0"}[1m])) by (namespace,instance,pod)`|
| pod         | pod_network_receive_packets      | 接收数据包的累积计数   | `sum(irate(container_network_receive_packets_total{namespace="hipstershop",interface="eth0"}[1m])) by (namespace,instance,pod)`|
| pod         | pod_network_transmit_packets     | 传输数据包的累积计数   | `sum(irate(container_network_transmit_packets_total{namespace="hipstershop",interface="eth0"}[1m])) by (namespace,instance,pod)`|
| node        | node_cpu_usage_rate             | CPU使用率      | `100 - (avg(irate(node_cpu_seconds_total{job="kubernetes-nodes",mode="idle"}[1m])) by (kubernetes_node,instance) * 100)`                                                                                                                |
| node        | node_memory_usage_rate          | 内存使用率     | `100 - (node_memory_MemFree_bytes{job="kubernetes-nodes"} + node_memory_Cached_bytes{job="kubernetes-nodes"} + node_memory_Buffers_bytes{job="kubernetes-nodes"}) / node_memory_MemTotal_bytes{job="kubernetes-nodes"} * 100`           |
| node        | node_filesystem_usage_rate      | 磁盘使用率     | `100 - (node_filesystem_free_bytes{job="kubernetes-nodes",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job="kubernetes-nodes",fstype=~"ext4|xfs"} * 100)`                                                                           |
| node        | node_memory_MemAvailable_bytes  | 空闲内存大小   | `node_memory_MemAvailable_bytes{job="kubernetes-nodes"}`                                                                                                                                                                                  |
| node        | node_memory_MemTotal_bytes      | 内存总大小     | `node_memory_MemTotal_bytes{job="kubernetes-nodes"}`                                                                                                                                                                                      |
| node        | node_filesystem_size_bytes      | 空闲磁盘大小   | `node_filesystem_size_bytes{job="kubernetes-nodes",mountpoint=~"/rootfs|/"}`                                                                                                                                                              |
| node        | node_filesystem_free_bytes      | 磁盘总大小     | `node_filesystem_free_bytes{job="kubernetes-nodes",mountpoint=~"/rootfs|/"}`     
| node        | node_disk_read_bytes_total             | 成功读取的字节数                                           | `sum(irate(node_disk_read_bytes_total{job="kubernetes-nodes",device="vda"}[1m])) by (kubernetes_node,instance,device)`                                                   |
| node        | node_disk_read_time_seconds_total      | Read time ms 每个磁盘分区读花费的毫秒数                     | `sum(irate(node_disk_read_time_seconds_total{job="kubernetes-nodes",device="vda"}[1m])) by (kubernetes_node,instance,device)`                                           |
| node        | node_disk_written_bytes_total          | 成功写入的字节数                                           | `sum(irate(node_disk_written_bytes_total{job="kubernetes-nodes",device="vda"}[1m])) by (kubernetes_node,instance,device)`                                                |
| node        | node_disk_write_time_seconds_total     | Write time ms 每个磁盘分区写操作花费的毫秒数               | `sum(irate(node_disk_write_time_seconds_total{job="kubernetes-nodes",device="vda"}[1m])) by (kubernetes_node,instance,device)`                                          |
| node        | node_network_receive_bytes_total       | `{{device}}` – Receive 各个网络接口接收速率               | `sum(irate(node_network_receive_bytes_total{job="kubernetes-nodes"}[1m])) by (kubernetes_node,instance)`                                                                 |
| node        | node_network_receive_packets_total     | `{{device}}` – Receive 各个接口每秒接收的数据包总数       | `sum(irate(node_network_receive_packets_total{job="kubernetes-nodes"}[1m])) by (kubernetes_node,instance)`                                                               |
| node        | node_network_transmit_bytes_total      | `{{device}}` – Transmit 各个网络接口发送速率              | `sum(irate(node_network_transmit_bytes_total{job="kubernetes-nodes"}[1m])) by (kubernetes_node,instance)`                                                                |
| node        | node_network_transmit_packets_total    | `{{device}}` – Transmit 各个接口每秒发送的数据包总数      | `sum(irate(node_network_transmit_packets_total{job="kubernetes-nodes"}[1m])) by (kubernetes_node,instance)`                                                              |
| node        | node_sockstat_TCP_inuse                | TCP_inuse – 正在使用(正在侦听)的 TCP 套接字数量         | `sum(irate(node_sockstat_TCP_inuse{job="kubernetes-nodes"}[1m])) by (kubernetes_node,instance)`                                                                          |
| tidb        | connection_count      | 连接数           | `sum(tidb_server_connections) by (namespace,instance)`                                                                                                                                                                                                                |
| tidb        | failed_query_ops      | 失败请求数       | `sum(increase(tidb_server_execute_error_total[1m])) by (namespace,type,instance)`                                                                                                                                                                                    |
| tidb        | duration_99th         | 99 分位请求延迟  | `histogram_quantile(0.99, sum(rate(tidb_server_handle_query_duration_seconds_bucket{sql_type!="internal"}[1m])) by (namespace,instance,le))`                                                                                                                            |
| tidb        | duration_95th         | 95 分位请求延迟  | `histogram_quantile(0.95, sum(rate(tidb_server_handle_query_duration_seconds_bucket{sql_type!="internal"}[1m])) by (namespace,instance,le))`                                                                                                                            |
| tidb        | duration_avg          | 平均请求延迟     | `sum(rate(tidb_server_handle_query_duration_seconds_sum{sql_type!="internal"}[2m])) by (namespace,instance) / sum(rate(tidb_server_handle_query_duration_seconds_count{sql_type!="internal"}[2m])) by (namespace,instance)`                                           |
| tidb        | qps                   | 请求数量         | `sum(rate(tidb_executor_statement_total[1m])) by (namespace,instance,type)`                                                                                                                                                                                           |
| tidb        | slow_query            | 慢查询           | `histogram_quantile(0.90, sum(rate(tidb_server_slow_query_process_duration_seconds_bucket{sql_type="general"}[1m])) by (namespace,instance,le,sql_type))`                                                                                                                |
| tidb        | block_cache_size      | block_cache_size | `avg(tikv_engine_block_cache_size_bytes{db="kv"}) by (namespace,instance,cf)`                                                                                                                                                                                           |
| tidb        | uptime                | 服务存活时长     | `max(time() - process_start_time_seconds{cluster="tidb",component="tidb"}) by (namespace,instance)`                                                                                                                                                                     |
| tidb        | cpu_usage             | CPU 使用率      | `sum(rate(process_cpu_seconds_total{cluster="tidb",component="tidb"}[1m])) by (namespace,instance)`                                                                                                                                                                     |
| tidb        | memory_usage          | 内存使用量       | `avg(process_resident_memory_bytes{cluster="tidb",component="tidb"}) by (namespace,instance)`                                                                                                                                                                          |
| tidb        | statement_avg_queries      | 多语句平均查询数       | `sum(rate(tidb_server_multi_query_num_sum[1m])) / sum(rate(tidb_server_multi_query_num_count[1m]))`                                                                 |
| tidb        | transaction_retry          | 事务重试次数           | `sum(rate(tidb_session_retry_num_sum[1m])) by (namespace,instance)`                                                                                                 |
| tidb        | ddl_job_count              | DDL 作业数             | `sum(tidb_ddl_running_job_count) by (namespace,instance,state)`                                                                                                     |
| tidb        | top_sql_cpu                | TopSQL CPU 消耗        | `sum(rate(tidb_topsql_report_duration_seconds_sum[1m])) by (namespace,instance,sql_type)`                                                                          |
| tidb        | server_is_up               | 服务存活节点数         | `count(up{cluster="tidb",component="tidb"} == 1) by (namespace)`                                                                                                    |
| tikv        | store_size                 | 已用存储容量           | `sum(tikv_store_size_bytes{type="used"}) by (namespace,instance)`                                                                                                   |
| tikv        | available_size             | 可用存储容量           | `sum(tikv_store_size_bytes{type="available"}) by (namespace,instance)`                                                                                              |
| tikv        | capacity_size              | 总存储容量             | `sum(tikv_store_size_bytes{type="capacity"}) by (namespace,instance)`                                                                                               |
| tikv        | cpu_usage                  | CPU 使用率            | `sum(rate(process_cpu_seconds_total{cluster="tidb",component="tikv"}[1m])) by (namespace,instance)`                                                                 |
| tikv        | memory_usage               | 内存使用量             | `avg(process_resident_memory_bytes{cluster="tidb",component="tikv"}) by (namespace,instance)`                                                                       |
| tikv        | io_util                    | IO 利用率              | `rate(node_disk_io_time_seconds_total[1m])`                                                                                                                         |
| tikv        | write_wal_mbps                       | 写入速率 (Mbps)            | `sum(rate(tikv_engine_flow_bytes{type="wal_file_bytes"}[1m])) by (namespace,instance)`                                                                                                       |
| tikv        | read_mbps                            | 读取速率 (Mbps)            | `sum(rate(tikv_engine_flow_bytes{type=~"bytes_read|iter_bytes_read"}[1m])) by (namespace,instance)`                                                                                         |
| tikv        | qps                                  | QPS                        | `sum(rate(tikv_storage_command_total[1m])) by (namespace,instance,type)`                                                                                                                     |
| tikv        | grpc_qps                             | 各类 gRPC 请求 QPS         | `sum(rate(tikv_grpc_msg_duration_seconds_count{type=~"coprocessor|kv_get|kv_batch_get|kv_prewrite|kv_commit"}[1m])) by (namespace,instance,type)`                                         |
| tikv        | threadpool_readpool_cpu              | StorageReadPool 线程池 CPU | `sum(rate(tikv_thread_cpu_seconds_total{name=~"unified_read_pool_.*"}[1m])) by (namespace,instance,name)`                                                                                   |
| tikv        | raft_propose_wait                    | RaftPropose 等待延迟 P99   | `histogram_quantile(0.99, sum(rate(tikv_raftstore_request_wait_time_duration_secs_bucket[1m])) by (le,namespace,instance))`                                                                    |
| tikv        | raft_apply_wait                      | RaftApply 等待延迟 P99     | `histogram_quantile(0.99, sum(rate(tikv_raftstore_apply_wait_time_duration_secs_bucket[1m])) by (le,namespace,instance))`                                                                      |
| tikv        | region_pending                       | PendingRegion 数量         | `sum(tikv_raftstore_read_index_pending_duration_count) by (namespace,instance)`                                                                                                              |
| tikv        | rocksdb_write_stall                  | RocksDB 写阻塞次数         | `sum(tikv_engine_write_stall) by (namespace,instance)`                                                                                                                                        |
| tikv        | snapshot_apply_count                 | SnapshotApply 次数         | `sum(rate(tikv_raftstore_apply_duration_secs_count[1m])) by (namespace,instance)`                                                                                                           |
| tikv        | server_is_up                         | 服务存活节点数             | `count(up{cluster="tidb",component="tikv"} == 1) by (namespace)`                                                                                                                              |
| tikv        | cpu_usage                            | CPU 使用率                 | `sum(rate(process_cpu_seconds_total{cluster="tidb",component="tikv"}[1m])) by (namespace,instance)`                                                                                           |
| pd          | storage_capacity           | 集群总容量           | `sum(pd_cluster_status{type="storage_capacity"}) by (namespace)`                                                                                                                                                                     |
| pd          | storage_size               | 已用容量             | `sum(pd_cluster_status{type="storage_size"}) by (namespace)`                                                                                                                                                                         |
| pd          | storage_used_ratio         | 已用容量比           | `sum(pd_cluster_status{type="storage_size"}) / sum(pd_cluster_status{type="storage_capacity"})`                                                                                                                                      |
| pd          | store_up_count             | 健康 Store 数量      | `sum(pd_cluster_status{type="store_up_count"}) by (namespace)`                                                                                                                                                                       |
| pd          | store_down_count           | Down Store 数量      | `sum(pd_cluster_status{type="store_down_count"}) by (namespace)`                                                                                                                                                                     |
| pd          | store_unhealth_count       | Unhealth Store 数量  | `sum(pd_cluster_status{type="store_unhealth_count"}) by (namespace)`                                                                                                                                                                 |
| pd          | store_low_space_count      | 低空间 Store 数量    | `sum(pd_cluster_status{type="store_low_space_count"}) by (namespace)`                                                                                                                                                                |
| pd          | store_slow_count           | 慢 Store 数量        | `sum(pd_cluster_status{type="store_slow_count"}) by (namespace)`                                                                                                                                                                     |
| pd          | abnormal_region_count      | 异常 Region 数量     | `sum(pd_regions_status{type=~"pending-peer-region-count\|down-peer-region-count\|offline-peer-region-count\|miss-peer-region-count\|learner-peer-region-count"}) by (namespace,type)`                                               |
| pd          | region_health              | Region 健康状态      | `max(pd_regions_status) by (namespace,type)`                                                                                                                                                                                         |
| pd          | leader_primary             | PDLeader-Primary 节点 | `service_member_role{cluster="tidb",component="pd",service="PD"}`                                                                                                                                                                     |
| pd          | region_count               | Region 总数          | `sum(pd_cluster_status{type="region_count"}) by (namespace)`                                                                                                                                                                         |
| pd          | leader_count               | Leader 总数          | `sum(pd_cluster_status{type="leader_count"}) by (namespace)`                                                                                                                                                                         |
| pd          | learner_count              | Learner 总数         | `sum(pd_cluster_status{type="learner_count"}) by (namespace)`                                                                                                                                                                        |
| pd          | witness_count              | Witness 总数         | `sum(pd_cluster_status{type="witness_count"}) by (namespace)`                                                                                                                                                                        |
| pd          | cpu_usage                  | CPU 使用率          | `sum(rate(process_cpu_seconds_total{cluster="tidb",component="pd"}[1m])) by (namespace,instance)`                                                                                                                                   |
| pd          | memory_usage               | 内存使用量           | `avg(process_resident_memory_bytes{cluster="tidb",component="pd"}) by (namespace,instance)`                                                                                                                                         |

字段释义(infra 样例数据文件)