# Prometheus
## 1 概述
“`sh
Prometheus是一个开源系统监控和警报工具包,最初由SoundCloud构建。自2012年成立以来,许多公司和组织都采用了Prometheus,该项目拥有非常活跃的开发人员和用户社区。它现在是一个独立的开源项目,独立于任何公司进行维护。为了强调这一点,并澄清该项目的治理结构,Prometheus于2016年加入云原生计算基金会(CNCF),成为继Kubernetes之后的第二个托管项目。
第一个是k8s
官网地址:
https://prometheus.io/
CNCF地址:
https://landscape.cncf.io/
虚拟机规划
root@prometheus-server31:~# tail -6 /etc/hosts
10.0.0.31 prometheus-server31
10.0.0.32 prometheus-server32
10.0.0.33 prometheus-server33
10.0.0.41 node-exporter41
10.0.0.42 node-exporter42
10.0.0.43 node-exporter43
所有虚拟机均是双网卡,另外一个网卡网段为192.168.137,IP地址最后一位都相同
“`
## 2 架构

“`sh
1 prometheus server
时间数据存储,监控指标管理
2 prometheus web ui
集群状态管理,promql
3 jobs exports
Exporter: 为当前的客户端暴露出符合Prometheus规则的数据指标,Exporter以守护进程的模式运行并开始采集数据,Exporter本身也是一个http_server可以对http请求作出响应返回数据(K/V形式的metrics)。
作用:采集中间件数据
4 pushgateway
服务发现: file,DNS,Kubernete,Consul,custom Integration,…
5 altermanager
告警
Prometheus由九个主要软件包组成,其职责如下:
– Prometheus Server:
彼此独立运行,仅依靠其本地存储来实现其核心功能,抓取时序数据,规则处理和报警等。
– Prometheus targets:
静态收集的目标服务数据。
– service discovery:
动态服务发现。
– Client Library:
客户端库,为需要监控的服务生成相应的metrics并暴露给Prometheus Server。
当Prometheus Server来pull时,直接返回实时状态的metrics。
– Push Gateway:
exporter采集型已经很丰富,但是依然需要很多自定义的监控数据,由pushgateway可以实现自定义的监控数据,任意灵活想做什么都可以做到。
exporter的开发需要使用真正的编程语音,不支持shell这种脚本,而pushgateway开发却容易的多。
pushgateway主要用于短期的jobs,由于这类jobs存在时间较短,可能是Prometheus来pull之前就消失了。为此,这次jobs可以直接向Prometheus server端推送它们的metrics,这种凡是主要用于服务层面的metrics,对于机器层面的metrics,需要使用node exporter。
– Exporters:
部署第三方软件主机上,用于暴露已有的第三方服务的metrics给Prometheus。
– Altermanager:
从Prometheus Server端接收到alters后,会进行去除重复数据,分组,并路由到对应的接收方式,以高效向用户完成告警信息发送。常见的方式有: 电子邮件,pagerduty,OpsGenie,Webhook等一些其他的工具。
– Data Visualization:
Prometheus Web UI(Prometheus Server内置的界面),Grafana(第三方可视化组件,需要单独部署)。
– Server Discovery:
动态发现待监控的Target,从而完成监控配置的重要组件,在容器化环境中尤为有用,该组件目前由Prometheus Server内建支持。
上述组件中,大多数都是用Go编写的,因此易于构建和部署为二进制文件。
参考地址:
https://prometheus.io/docs/introduction/overview/
https://github.com/prometheus/prometheus
“`
## 3 二进制部署prometheus
“`sh
wget https://github.com/prometheus/prometheus/releases/download/v2.53.3/prometheus-2.53.3.linux-amd64.tar.gz
tar xf prometheus-2.53.3.linux-amd64/ -C /software
/software/prometheus/prometheus
“`
## 4 脚本部署prometheus
“`sh
1 上传tar.gz包
tar -tf install-prometheus-server-v2.53.3.tar.gz
./download/
./download/prometheus-2.53.3.linux-amd64.tar.gz
./install-prometheus-server.sh
2 解压安装
./install-prometheus-server.sh i
3 访问webUI
http://192.168.137.31:9090
4 如果要卸载
./install-prometheus-server.sh r
5 脚本内容
cat install-prometheus-server.sh
#!/bin/bash
VERSION=2.53.3
ARCH=amd64
SOFTWARE=prometheus-${VERSION}.linux-${ARCH}.tar.gz
URL=https://github.com/prometheus/prometheus/releases/download/v${VERSION}/${SOFTWARE}
DOWNLOAD=./download
INSTALLDIR=/software
BASEDIR=${INSTALLDIR}/prometheus-${VERSION}.linux-amd64
DATADIR=/prometheus/data/prometheus
LOGDIR=/prometheus/logs/prometheus
HOSTIP=0.0.0.0
PORT=9090
HOSTNAME=`hostname`
function prepare() {
# 判断目录是否存在,若不存在则创建
[ -d $INSTALLDIR ] || install -d ${INSTALLDIR}
[ -d $DOWNLOAD ] || install -d ${DOWNLOAD}
[ -d $DATADIR ] || install -d ${DATADIR}
[ -d $LOGDIR ] || install -d ${LOGDIR}
. /etc/os-release
if [ “$ID” == “centos” ];then
# 判断系统是否安装wget
[ -f /usr/bin/wget ] || yum -y install wget
fi
# 判断文件是否存在,若不存在则下载
[ -s ${DOWNLOAD}/${SOFTWARE} ] || wget $URL -O ${DOWNLOAD}/${SOFTWARE}
}
function deploy() {
# 检查环境
prepare
# 解压文件软件包
tar xf ${DOWNLOAD}/${SOFTWARE} -C ${INSTALLDIR}
# 生成启动脚本
cat > /etc/systemd/system/prometheus-server.service <
ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target
EOF
# 将服务设置为开机自启动
systemctl daemon-reload
systemctl enable –now prometheus-server
systemctl status prometheus-server
sleep 5
ss -ntl | grep ${PORT}
}
function delete(){
systemctl disable –now prometheus-server.service
rm -rf /etc/systemd/system/node-exporter.service $BASEDIR $DATADIR $LOGDIR
}
function main() {
case $1 in
deploy|i)
deploy
echo “脚本: ${HOSTNAME} 的prometheus-server 已经部署成功![successfully]”
;;
delete|r)
delete
echo “脚本: ${HOSTNAME} 的prometheus-server 已经卸载成功,期待下次使用~”
;;
*)
echo “Usage: $0 deploy[i]|delete[r]”
;;
esac
}
main $1
“`

## 5 脚本部署node-exporter
“`sh
1 上传tar.gz包
tar -tf install-node-exporter-v1.8.2.tar.gz
./download/
./download/node_exporter-1.8.2.linux-amd64.tar.gz
./install-node-exporter.sh
2 解压安装
./install-node-exporter.sh i
3 访问webUI
http://192.168.137.41:9100/metric
http://192.168.137.42:9100/metric
http://192.168.137.43:9100/metric
4 如果要卸载
./install-node-exporter.sh r
5 脚本内容
cat install-node-exporter.sh
#!/bin/bash
VERSION=1.8.2
SOFTWARE=node_exporter-${VERSION}.linux-amd64.tar.gz
URL=https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/${SOFTWARE}
DOWNLOAD=./download
INSTALLDIR=/software
BASEDIR=${INSTALLDIR}/node_exporter-${VERSION}.linux-amd64
HOST=”0.0.0.0″
PORT=9100
hostname=`hostname`
function prepare() {
# 判断目录是否存在,若不存在则创建
[ -d $INSTALLDIR ] || mkdir -pv ${INSTALLDIR}
[ -d $DOWNLOAD ] || mkdir -pv ${DOWNLOAD}
if [ “$ID” == “centos” ];then
# 判断系统是否安装curl
[ -f /usr/bin/wget ] || yum -y install wget
fi
# 判断文件是否存在,若不存在则下载
[ -s ${DOWNLOAD}/${SOFTWARE} ] || wget $URL -O ${DOWNLOAD}/${SOFTWARE}
}
function install() {
# 检查环境
prepare
# 解压文件软件包
tar xf ${DOWNLOAD}/${SOFTWARE} -C ${INSTALLDIR}
# 生成启动脚本
cat > /etc/systemd/system/node-exporter.service <
假设tcp_wait_conn是咱们自定义的KEY。
若TCP等待数量大于500的机器数量就判断条件为真。
count(rate(node_cpu_seconds_total{cpu=”0″,mode=”idle”}[1m]))
对统计的结果进行计数。
7 其他函数
推荐阅读:
https://prometheus.io/docs/prometheus/latest/querying/functions/
– 监控CPU的使用情况案例
1 统计各个节点CPU的使用率
1.1 我们需要先找到CPU相关的KEY
node_cpu_seconds_total
1.2 过滤出CPU的空闲时间({mode=’idle’})和全部CPU的时间(‘{}’)
node_cpu_seconds_total{mode=’idle’}
过滤CPU的空闲时间。
node_cpu_seconds_total{}
此处的'{}’可以不写,因为里面没有任何参数,代表获取CPU的所有状态时间。
1.3 统计1分钟内CPU的增量时间
increase(node_cpu_seconds_total{mode=’idle’}[1m])
统计1分钟内CPU空闲状态的增量。
increase(node_cpu_seconds_total[1m])
统计1分钟内CPU所有状态的增量。
1.4 将结果进行加和统计
sum(increase(node_cpu_seconds_total{mode=’idle’}[1m]))
将1分钟内所有CPU空闲时间的增量进行加和计算。
sum(increase(node_cpu_seconds_total[1m]))
将1分钟内所有CPU空闲时间的增量进行加和计算。
1.5 按照不同节点进行分组
sum(increase(node_cpu_seconds_total{mode=’idle’}[1m])) by (instance)
将1分钟内所有CPU空闲时间的增量进行加和计算,并按照机器实例进行分组。
sum(increase(node_cpu_seconds_total[1m])) by (instance)
将1分钟内所有CPU空闲时间的增量进行加和计算,并按照机器实例进行分组。
1.6 计算1分钟内CPU空闲时间的百分比
sum(increase(node_cpu_seconds_total{mode=’idle’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)
1.7 统计1分钟内CPU的使用率,计算公式: (1 – CPU空闲时间的百分比) * 100%。
(1 – sum(increase(node_cpu_seconds_total{mode=’idle’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
1.8 统计1小时内CPU的使用率,计算公式: (1 – CPU空闲时间的百分比) * 100%。
(1 – sum(increase(node_cpu_seconds_total{mode=’idle’}[1h])) by (instance) / sum(increase(node_cpu_seconds_total[1h])) by (instance)) * 100
2 计算CPU用户态的1分钟内百分比
sum(increase(node_cpu_seconds_total{mode=’user’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance) * 100
3 计算CPU内核态的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode=’system’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
4 计算CPU IO等待时间的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode=’iowait’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
5 通过top指令查看CPU
top
“`

## 9 grafana基于mysql作为数据存储添加prometheus
“`sh
1 查看官网
https://grafana.com/grafana/download/9.5.21?pg=graf&plcmt=deploy-box-1
2 下载安装依赖和grafana
sudo apt-get install -y adduser libfontconfig1 musl
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.5.21_amd64.deb
sudo dpkg -i grafana-enterprise_9.5.21_amd64.deb
3 docker部署mysql
docker load < mysql-v8.0.36-oracle.tar.gz
fc037c17567d: Loading layer 118.8MB/118.8MB
152c1ecea280: Loading layer 11.26kB/11.26kB
fb5c92e924ab: Loading layer 2.359MB/2.359MB
5b76076a2dd4: Loading layer 13.86MB/13.86MB
a6909c467615: Loading layer 6.656kB/6.656kB
eaa1e85de732: Loading layer 3.072kB/3.072kB
9513d2aedd12: Loading layer 185.6MB/185.6MB
84d659420bad: Loading layer 3.072kB/3.072kB
876b8cd855eb: Loading layer 298.7MB/298.7MB
1c0ff7ed67c4: Loading layer 16.9kB/16.9kB
318dde184d61: Loading layer 1.536kB/1.536kB
Loaded image: mysql:8.0.36-oracle
4 启动mysql
docker run -d --name mysql-server --restart always --network host -e MYSQL_ALLOW_EMPTY_PASSWORD=yes -e MYSQL_DATABASE=prometheus -e MYSQL_USER=ysl -e MYSQL_PASSWORD=123456 mysql:8.0.36-oracle --character-set-server=utf8 --collation-server=utf8_bin --default-authentication-plugin=mysql_native_password
5 修改grafana配置文件
vim /etc/grafana/grafana.ini
...
type = mysql
host = 10.0.0.43:3306
name = prometheus
user = ysl
password = 123456
6 启动grafana
systemctl restart grafana-server.service
ss -ntl | grep 3000
LISTEN 0 4096 *:3000 *:*
7.登录Grafana的webUI
http://10.0.0.31:3000/
```

## 10 grafana使用
### 1 添加prometheus




### 2 添加dashboard





### 3 继续添加dashboard

### 4 保存dashboard



### 5 重新打开grafana查看刚刚保存的dashboard

### 6 添加ROW
添加row后可折叠



### 7 grafana 表格定义
#### 1 添加可视化

#### 2 查询服务器信息
node_boot_time_seconds
avg(node_uname_info) by (instance,nodename,release)



```sh
找到transform,查找filter by name,这里几个参数都是对应图中的小标题,这里小标题不人性化,可以重新设置
```






```sh
此时instance小标题就变成了修改的标题,按照此方法一直添加、修改
```

#### 3 复制表格

#### 4 合并



#### 5 panel改名



```sh
参考链接:
https://www.cnblogs.com/yinzhengjie/p/18538430
```
同样方法增加cpu使用率(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance)) * 100
内存大小 node_memory_MemTotal_bytes-0
增加内存时要注意:以下图解



再统计内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))*100
网卡出流量max(rate(node_network_transmit_bytes_total[1m])) by (instance)

#### 6 配置阈值和配色



```sh
此处做一个测试,再node-exporter41上做压测,看是否会变成红色
stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 1m
```

## 8 grafana自定义变量




metric配置为{job="node-exporter"}
node-exporter是从prometheus来的



修改以前配置的dashboard,将变量写入以前配置文件中

```sh
计算CPU IO等待时间的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode='iowait',instance="$host"}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
```

```sh
统计一分钟内cpu的使用率
(1 - sum(increase(node_cpu_seconds_total{mode='idle',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
```

```sh
计算CPU IO等待时间的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode='iowait',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
```

```sh
计算CPU内核态的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode='system',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
```

最后选择查看

## 9 grafana dashboard备份和恢复
### 1 备份
#### 1 下载json备份


#### 2 import 备份


### 2 恢复
#### 1 删除


#### 2 恢复




```sh
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 3,
"links": [],
"liveNow": false,
"panels": [
{
"collapsed": true,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 8,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {
"align": "auto",
"cellOptions": {
"type": "auto"
},
"inspect": false
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green"
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "instance"
},
"properties": [
{
"id": "displayName",
"value": "服务器IP"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Time"
},
"properties": [
{
"id": "displayName",
"value": "开机时间"
}
]
},
{
"matcher": {
"id": "byName",
"options": "nodename"
},
"properties": [
{
"id": "displayName",
"value": "主机名"
}
]
},
{
"matcher": {
"id": "byName",
"options": "release"
},
"properties": [
{
"id": "displayName",
"value": "内核版本"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Value #load"
},
"properties": [
{
"id": "displayName",
"value": "负载"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Value #core"
},
"properties": [
{
"id": "displayName",
"value": "CPU使用率"
},
{
"id": "custom.cellOptions",
"value": {
"type": "color-background"
}
}
]
},
{
"matcher": {
"id": "byName",
"options": "Value #Memory"
},
"properties": [
{
"id": "unit",
"value": "bytes"
},
{
"id": "displayName",
"value": "内存总量"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Value #Memory Used"
},
"properties": [
{
"id": "displayName",
"value": "内存使用率"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Value #NetWork"
},
"properties": [
{
"id": "displayName",
"value": "网卡流量"
},
{
"id": "unit",
"value": "bytes"
}
]
}
]
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 1
},
"id": 7,
"options": {
"cellHeight": "sm",
"footer": {
"countRows": false,
"fields": "",
"reducer": [
"sum"
],
"show": false
},
"showHeader": true
},
"pluginVersion": "9.5.21",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"exemplar": false,
"expr": "avg(node_uname_info) by (instance,nodename,release)",
"format": "table",
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "kernel"
},
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"exemplar": false,
"expr": "node_load5-0",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "load"
},
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"exemplar": false,
"expr": "(1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[1m])) by (instance)) * 100",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "core"
},
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"exemplar": false,
"expr": "node_memory_MemTotal_bytes-0",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "Memory"
},
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"exemplar": false,
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))*100",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "Memory Used"
},
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"exemplar": false,
"expr": "max(rate(node_network_transmit_bytes_total[1m])) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "NetWork"
}
],
"title": "服务器集群概览",
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": {
"names": [
"Time",
"instance",
"nodename",
"release",
"Value #load",
"Value #core",
"Value #Memory",
"Value #Memory Used",
"Value #NetWork"
]
}
}
},
{
"id": "merge",
"options": {}
}
],
"transparent": true,
"type": "table"
}
],
"title": "Overview",
"type": "row"
},
{
"collapsed": false,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 1
},
"id": 6,
"panels": [],
"title": "CPU监控",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": [
{
"__systemRef": "hideSeriesFrom",
"matcher": {
"id": "byNames",
"options": {
"mode": "exclude",
"names": [
"10.0.0.41:9100"
],
"prefix": "All except:",
"readOnly": true
}
},
"properties": [
{
"id": "custom.hideFrom",
"value": {
"legend": false,
"tooltip": false,
"viz": true
}
}
]
}
]
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 2
},
"id": 5,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"expr": "(sum(increase(node_cpu_seconds_total{mode='iowait',instance=\"$host\"}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100\r\n",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "计算CPU IO等待时间的1分钟内百分比",
"transparent": true,
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 2
},
"id": 1,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"expr": "(1 - sum(increase(node_cpu_seconds_total{mode='idle',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100\r\n",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "统计一分钟内cpu的使用率",
"transparent": true,
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 10
},
"id": 4,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"expr": "(sum(increase(node_cpu_seconds_total{mode='iowait',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100\r\n",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "计算CPU IO等待时间的1分钟内百分比",
"transparent": true,
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 10
},
"id": 3,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"editorMode": "code",
"expr": "(sum(increase(node_cpu_seconds_total{mode='system',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100\r\n",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "计算CPU内核态的1分钟内百分比",
"transparent": true,
"type": "timeseries"
}
],
"refresh": "",
"schemaVersion": 38,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"current": {
"selected": false,
"text": "10.0.0.41:9100",
"value": "10.0.0.41:9100"
},
"datasource": {
"type": "prometheus",
"uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f"
},
"definition": "label_values({job=\"node-exporter\"},instance)",
"description": "需要查看的具体主机的IP地址",
"hide": 0,
"includeAll": false,
"label": "选择查询节点",
"multi": false,
"name": "host",
"options": [],
"query": {
"query": "label_values({job=\"node-exporter\"},instance)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"type": "query"
}
]
},
"time": {
"from": "2025-03-21T02:43:07.257Z",
"to": "2025-03-21T02:46:44.145Z"
},
"timepicker": {},
"timezone": "",
"title": "自定义dashboard-cpu",
"uid": "f0b29f48-f63b-422f-a9b7-85702ad5a6c0",
"version": 4,
"weekStart": ""
}
```
## 10 导入第三方Node Exporter


去官网查自己想要的模版,记录id号




可以每个点进去查看PromQL,然后借鉴
## 11 node exporter
```sh
node exporter整个流程
1 主要作用,采集linux的一些指标,内存,磁盘,网络等,通过http端口9100暴露出去
2 prometheus采集node exporter指标保存到本地,prometheus配置了配置文件,指向了node exporter
3 grafana配置数据源指向prometheus,展示数据
4 grafana 定义变量、dashboard、行信息等
5 从官网导入模版,修改
```
## 12 prometheus监控
### 1 监控windows
```sh
1 官网下载安装包
2 windows运行
```

```sh
3 修改prometheus配置文件
vim /software/prometheus-2.53.3.linux-amd64/prometheus.yml
...
- job_name: "windows-exporter"
static_configs:
- targets: ["192.168.137.254:9182"]
4 热加载prometheus
curl -X POST 10.0.0.31:9090/-/reload
5 prometheus 查看
6 官网下载windows模版
20763
```


### 2 监控zookeeper集群
1 zookeeper启用metrc指标
```sh
zookeeper每个阶段执行以下命令
cat >>/software/zookeeper/conf/zoo.cfg<
Query OK, 0 rows affected (0.08 sec)
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO ‘exporter’@’%’;
Query OK, 0 rows affected (0.01 sec)
5 启动mysql_expoerter
root@node-exporter43:~# cat >>/etc/.my.cnf<
– 指标样本值引用: {{ $value }}
为了显式效果,需要了解一些html相关技术,参考链接:
https://www.w3school.com.cn/html/index.asp
2 altertmanger节点自定义告警模板参考案例 {{ range \$i, \$alert := .Alerts }} {{ end }} {{ end }} 2.2 alermanager引用自定义模板文件
2.1 自定义邮件模板
[root@prometheus-server32]# cat >/software/alertmanager-0.27.0.linux-amd64/email.tmpl<
报警项
实例
报警阀值
开始时间
{{ index \$alert.Labels “alertname” }}
{{ index \$alert.Labels “instance” }}
{{ index \$alert.Annotations “value” }}
{{ \$alert.StartsAt }}
EOF
[root@prometheus-server32 alertmanager-0.27.0.linux-amd64]# cat >>/software/alertmanager-0.27.0.linux-amd64/alertmanager.yml <