PROMETHEUS

# Prometheus

## 1 概述

“`sh
Prometheus是一个开源系统监控和警报工具包,最初由SoundCloud构建。自2012年成立以来,许多公司和组织都采用了Prometheus,该项目拥有非常活跃的开发人员和用户社区。它现在是一个独立的开源项目,独立于任何公司进行维护。为了强调这一点,并澄清该项目的治理结构,Prometheus于2016年加入云原生计算基金会(CNCF),成为继Kubernetes之后的第二个托管项目。

第一个是k8s
官网地址:
https://prometheus.io/

CNCF地址:
https://landscape.cncf.io/

虚拟机规划
root@prometheus-server31:~# tail -6 /etc/hosts
10.0.0.31 prometheus-server31
10.0.0.32 prometheus-server32
10.0.0.33 prometheus-server33
10.0.0.41 node-exporter41
10.0.0.42 node-exporter42
10.0.0.43 node-exporter43

所有虚拟机均是双网卡,另外一个网卡网段为192.168.137,IP地址最后一位都相同
“`

## 2 架构

![image-20250320204322088](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320204322088.png)

“`sh
1 prometheus server
时间数据存储,监控指标管理

2 prometheus web ui
集群状态管理,promql

3 jobs exports
Exporter: 为当前的客户端暴露出符合Prometheus规则的数据指标,Exporter以守护进程的模式运行并开始采集数据,Exporter本身也是一个http_server可以对http请求作出响应返回数据(K/V形式的metrics)。
作用:采集中间件数据

4 pushgateway
服务发现: file,DNS,Kubernete,Consul,custom Integration,…

5 altermanager
告警

Prometheus由九个主要软件包组成,其职责如下:
– Prometheus Server:
彼此独立运行,仅依靠其本地存储来实现其核心功能,抓取时序数据,规则处理和报警等。

– Prometheus targets:
静态收集的目标服务数据。

– service discovery:
动态服务发现。

– Client Library:
客户端库,为需要监控的服务生成相应的metrics并暴露给Prometheus Server。
当Prometheus Server来pull时,直接返回实时状态的metrics。

– Push Gateway:
exporter采集型已经很丰富,但是依然需要很多自定义的监控数据,由pushgateway可以实现自定义的监控数据,任意灵活想做什么都可以做到。
exporter的开发需要使用真正的编程语音,不支持shell这种脚本,而pushgateway开发却容易的多。
pushgateway主要用于短期的jobs,由于这类jobs存在时间较短,可能是Prometheus来pull之前就消失了。为此,这次jobs可以直接向Prometheus server端推送它们的metrics,这种凡是主要用于服务层面的metrics,对于机器层面的metrics,需要使用node exporter。

– Exporters:
部署第三方软件主机上,用于暴露已有的第三方服务的metrics给Prometheus。

– Altermanager:
从Prometheus Server端接收到alters后,会进行去除重复数据,分组,并路由到对应的接收方式,以高效向用户完成告警信息发送。常见的方式有: 电子邮件,pagerduty,OpsGenie,Webhook等一些其他的工具。

– Data Visualization:
Prometheus Web UI(Prometheus Server内置的界面),Grafana(第三方可视化组件,需要单独部署)。

– Server Discovery:
动态发现待监控的Target,从而完成监控配置的重要组件,在容器化环境中尤为有用,该组件目前由Prometheus Server内建支持。

上述组件中,大多数都是用Go编写的,因此易于构建和部署为二进制文件。
参考地址:
https://prometheus.io/docs/introduction/overview/
https://github.com/prometheus/prometheus

“`

## 3 二进制部署prometheus

“`sh
wget https://github.com/prometheus/prometheus/releases/download/v2.53.3/prometheus-2.53.3.linux-amd64.tar.gz

tar xf prometheus-2.53.3.linux-amd64/ -C /software

/software/prometheus/prometheus

“`

## 4 脚本部署prometheus

“`sh
1 上传tar.gz包
tar -tf install-prometheus-server-v2.53.3.tar.gz
./download/
./download/prometheus-2.53.3.linux-amd64.tar.gz
./install-prometheus-server.sh

2 解压安装
./install-prometheus-server.sh i

3 访问webUI
http://192.168.137.31:9090

4 如果要卸载
./install-prometheus-server.sh r

5 脚本内容
cat install-prometheus-server.sh
#!/bin/bash

VERSION=2.53.3
ARCH=amd64
SOFTWARE=prometheus-${VERSION}.linux-${ARCH}.tar.gz
URL=https://github.com/prometheus/prometheus/releases/download/v${VERSION}/${SOFTWARE}
DOWNLOAD=./download
INSTALLDIR=/software
BASEDIR=${INSTALLDIR}/prometheus-${VERSION}.linux-amd64
DATADIR=/prometheus/data/prometheus
LOGDIR=/prometheus/logs/prometheus
HOSTIP=0.0.0.0
PORT=9090
HOSTNAME=`hostname`

function prepare() {
# 判断目录是否存在,若不存在则创建
[ -d $INSTALLDIR ] || install -d ${INSTALLDIR}
[ -d $DOWNLOAD ] || install -d ${DOWNLOAD}
[ -d $DATADIR ] || install -d ${DATADIR}
[ -d $LOGDIR ] || install -d ${LOGDIR}

. /etc/os-release

if [ “$ID” == “centos” ];then
# 判断系统是否安装wget
[ -f /usr/bin/wget ] || yum -y install wget
fi

# 判断文件是否存在,若不存在则下载
[ -s ${DOWNLOAD}/${SOFTWARE} ] || wget $URL -O ${DOWNLOAD}/${SOFTWARE}

}

function deploy() {
# 检查环境
prepare

# 解压文件软件包
tar xf ${DOWNLOAD}/${SOFTWARE} -C ${INSTALLDIR}

# 生成启动脚本
cat > /etc/systemd/system/prometheus-server.service <> ${LOGDIR}/prometheus-server.log”
ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65535

[Install]
WantedBy=multi-user.target
EOF

# 将服务设置为开机自启动
systemctl daemon-reload
systemctl enable –now prometheus-server
systemctl status prometheus-server
sleep 5
ss -ntl | grep ${PORT}
}

function delete(){
systemctl disable –now prometheus-server.service
rm -rf /etc/systemd/system/node-exporter.service $BASEDIR $DATADIR $LOGDIR
}

function main() {
case $1 in
deploy|i)
deploy
echo “脚本: ${HOSTNAME} 的prometheus-server 已经部署成功![successfully]”
;;
delete|r)
delete
echo “脚本: ${HOSTNAME} 的prometheus-server 已经卸载成功,期待下次使用~”
;;
*)
echo “Usage: $0 deploy[i]|delete[r]”
;;
esac
}

main $1
“`

![image-20250320212744495](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320212744495.png)

## 5 脚本部署node-exporter

“`sh
1 上传tar.gz包
tar -tf install-node-exporter-v1.8.2.tar.gz
./download/
./download/node_exporter-1.8.2.linux-amd64.tar.gz
./install-node-exporter.sh

2 解压安装
./install-node-exporter.sh i

3 访问webUI
http://192.168.137.41:9100/metric
http://192.168.137.42:9100/metric
http://192.168.137.43:9100/metric

4 如果要卸载
./install-node-exporter.sh r

5 脚本内容
cat install-node-exporter.sh
#!/bin/bash
VERSION=1.8.2
SOFTWARE=node_exporter-${VERSION}.linux-amd64.tar.gz
URL=https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/${SOFTWARE}
DOWNLOAD=./download
INSTALLDIR=/software
BASEDIR=${INSTALLDIR}/node_exporter-${VERSION}.linux-amd64
HOST=”0.0.0.0″
PORT=9100
hostname=`hostname`

function prepare() {
# 判断目录是否存在,若不存在则创建
[ -d $INSTALLDIR ] || mkdir -pv ${INSTALLDIR}
[ -d $DOWNLOAD ] || mkdir -pv ${DOWNLOAD}

if [ “$ID” == “centos” ];then
# 判断系统是否安装curl
[ -f /usr/bin/wget ] || yum -y install wget
fi

# 判断文件是否存在,若不存在则下载
[ -s ${DOWNLOAD}/${SOFTWARE} ] || wget $URL -O ${DOWNLOAD}/${SOFTWARE}
}

function install() {
# 检查环境
prepare

# 解压文件软件包
tar xf ${DOWNLOAD}/${SOFTWARE} -C ${INSTALLDIR}

# 生成启动脚本
cat > /etc/systemd/system/node-exporter.service < 500):
假设tcp_wait_conn是咱们自定义的KEY。
若TCP等待数量大于500的机器数量就判断条件为真。

count(rate(node_cpu_seconds_total{cpu=”0″,mode=”idle”}[1m]))
对统计的结果进行计数。

7 其他函数
推荐阅读:
https://prometheus.io/docs/prometheus/latest/querying/functions/

– 监控CPU的使用情况案例
1 统计各个节点CPU的使用率
1.1 我们需要先找到CPU相关的KEY
node_cpu_seconds_total

1.2 过滤出CPU的空闲时间({mode=’idle’})和全部CPU的时间(‘{}’)
node_cpu_seconds_total{mode=’idle’}
过滤CPU的空闲时间。
node_cpu_seconds_total{}
此处的'{}’可以不写,因为里面没有任何参数,代表获取CPU的所有状态时间。

1.3 统计1分钟内CPU的增量时间
increase(node_cpu_seconds_total{mode=’idle’}[1m])
统计1分钟内CPU空闲状态的增量。
increase(node_cpu_seconds_total[1m])
统计1分钟内CPU所有状态的增量。

1.4 将结果进行加和统计
sum(increase(node_cpu_seconds_total{mode=’idle’}[1m]))
将1分钟内所有CPU空闲时间的增量进行加和计算。
sum(increase(node_cpu_seconds_total[1m]))
将1分钟内所有CPU空闲时间的增量进行加和计算。

1.5 按照不同节点进行分组
sum(increase(node_cpu_seconds_total{mode=’idle’}[1m])) by (instance)
将1分钟内所有CPU空闲时间的增量进行加和计算,并按照机器实例进行分组。
sum(increase(node_cpu_seconds_total[1m])) by (instance)
将1分钟内所有CPU空闲时间的增量进行加和计算,并按照机器实例进行分组。

1.6 计算1分钟内CPU空闲时间的百分比
sum(increase(node_cpu_seconds_total{mode=’idle’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)

1.7 统计1分钟内CPU的使用率,计算公式: (1 – CPU空闲时间的百分比) * 100%。
(1 – sum(increase(node_cpu_seconds_total{mode=’idle’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100

1.8 统计1小时内CPU的使用率,计算公式: (1 – CPU空闲时间的百分比) * 100%。
(1 – sum(increase(node_cpu_seconds_total{mode=’idle’}[1h])) by (instance) / sum(increase(node_cpu_seconds_total[1h])) by (instance)) * 100

2 计算CPU用户态的1分钟内百分比
sum(increase(node_cpu_seconds_total{mode=’user’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance) * 100

3 计算CPU内核态的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode=’system’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100

4 计算CPU IO等待时间的1分钟内百分比
(sum(increase(node_cpu_seconds_total{mode=’iowait’}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100

5 通过top指令查看CPU
top
“`

![image-20250320220120547](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320220120547.png)

## 9 grafana基于mysql作为数据存储添加prometheus

“`sh
1 查看官网
https://grafana.com/grafana/download/9.5.21?pg=graf&plcmt=deploy-box-1

2 下载安装依赖和grafana
sudo apt-get install -y adduser libfontconfig1 musl
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.5.21_amd64.deb
sudo dpkg -i grafana-enterprise_9.5.21_amd64.deb

3 docker部署mysql
docker load < mysql-v8.0.36-oracle.tar.gz fc037c17567d: Loading layer 118.8MB/118.8MB 152c1ecea280: Loading layer 11.26kB/11.26kB fb5c92e924ab: Loading layer 2.359MB/2.359MB 5b76076a2dd4: Loading layer 13.86MB/13.86MB a6909c467615: Loading layer 6.656kB/6.656kB eaa1e85de732: Loading layer 3.072kB/3.072kB 9513d2aedd12: Loading layer 185.6MB/185.6MB 84d659420bad: Loading layer 3.072kB/3.072kB 876b8cd855eb: Loading layer 298.7MB/298.7MB 1c0ff7ed67c4: Loading layer 16.9kB/16.9kB 318dde184d61: Loading layer 1.536kB/1.536kB Loaded image: mysql:8.0.36-oracle 4 启动mysql docker run -d --name mysql-server --restart always --network host -e MYSQL_ALLOW_EMPTY_PASSWORD=yes -e MYSQL_DATABASE=prometheus -e MYSQL_USER=ysl -e MYSQL_PASSWORD=123456 mysql:8.0.36-oracle --character-set-server=utf8 --collation-server=utf8_bin --default-authentication-plugin=mysql_native_password 5 修改grafana配置文件 vim /etc/grafana/grafana.ini ... type = mysql host = 10.0.0.43:3306 name = prometheus user = ysl password = 123456 6 启动grafana systemctl restart grafana-server.service ss -ntl | grep 3000 LISTEN 0 4096 *:3000 *:* 7.登录Grafana的webUI http://10.0.0.31:3000/ ``` ![image-20250320224003443](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320224003443.png) ## 10 grafana使用 ### 1 添加prometheus ![image-20250320224404386](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320224404386.png) ![image-20250320224430378](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320224430378.png) ![image-20250320224528821](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320224528821.png) ![image-20250320224600031](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320224600031.png) ### 2 添加dashboard ![image-20250320224723160](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320224723160.png) ![image-20250320224742808](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250320224742808.png) ![image-20250321103350589](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321103350589.png) ![image-20250321104835583](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321104835583.png) ![image-20250321104418308](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321104418308.png) ### 3 继续添加dashboard ![image-20250321104513530](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321104513530.png) ### 4 保存dashboard ![image-20250321105619856](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321105619856.png) ![image-20250321105734581](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321105734581.png) ![image-20250321105751437](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321105751437.png) ### 5 重新打开grafana查看刚刚保存的dashboard ![image-20250321111736157](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321111736157.png) ### 6 添加ROW 添加row后可折叠 ![image-20250321111948631](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321111948631.png) ![image-20250321112023191](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321112023191.png) ![image-20250321112154431](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321112154431.png) ### 7 grafana 表格定义 #### 1 添加可视化 ![image-20250321112628589](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321112628589.png) #### 2 查询服务器信息 node_boot_time_seconds avg(node_uname_info) by (instance,nodename,release) ![image-20250321112948035](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321112948035.png) ![image-20250321113020884](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321113020884.png) ![image-20250321114136180](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321114136180.png) ```sh 找到transform,查找filter by name,这里几个参数都是对应图中的小标题,这里小标题不人性化,可以重新设置 ``` ![image-20250321115835568](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321115835568.png) ![image-20250321115933440](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321115933440.png) ![image-20250321120009678](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321120009678.png) ![image-20250321120149993](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321120149993.png) ![image-20250321120328044](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321120328044.png) ![image-20250321120431497](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321120431497.png) ```sh 此时instance小标题就变成了修改的标题,按照此方法一直添加、修改 ``` ![image-20250321120748972](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321120748972.png) #### 3 复制表格 ![image-20250321121205882](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321121205882.png) #### 4 合并 ![image-20250321121617681](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321121617681.png) ![image-20250321121704128](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321121704128.png) ![image-20250321121909641](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321121909641.png) #### 5 panel改名 ![image-20250321122002201](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321122002201.png) ![image-20250321122157294](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321122157294.png) ![image-20250321122325536](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321122325536.png) ```sh 参考链接: https://www.cnblogs.com/yinzhengjie/p/18538430 ``` 同样方法增加cpu使用率(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance)) * 100 内存大小 node_memory_MemTotal_bytes-0 增加内存时要注意:以下图解 ![image-20250321124720154](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321124720154.png) ![image-20250321125049833](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321125049833.png) ![image-20250321125233877](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321125233877.png) 再统计内存使用率 (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))*100 网卡出流量max(rate(node_network_transmit_bytes_total[1m])) by (instance) ![image-20250321130045485](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321130045485.png) #### 6 配置阈值和配色 ![image-20250321130354715](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321130354715.png) ![image-20250321130436197](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321130436197.png) ![image-20250321130525137](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321130525137.png) ```sh 此处做一个测试,再node-exporter41上做压测,看是否会变成红色 stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 1m ``` ![image-20250321130853822](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321130853822.png) ## 8 grafana自定义变量 ![image-20250321145120549](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321145120549.png) ![image-20250321145143760](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321145143760.png) ![image-20250321145838991](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321145838991.png) ![image-20250321145931607](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321145931607.png) metric配置为{job="node-exporter"} node-exporter是从prometheus来的 ![image-20250321150452714](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321150452714.png) ![image-20250321150402838](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321150402838.png) ![image-20250321151009769](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321151009769.png) 修改以前配置的dashboard,将变量写入以前配置文件中 ![image-20250321151500522](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321151500522.png) ```sh 计算CPU IO等待时间的1分钟内百分比 (sum(increase(node_cpu_seconds_total{mode='iowait',instance="$host"}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100 ``` ![image-20250321151547891](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321151547891.png) ```sh 统计一分钟内cpu的使用率 (1 - sum(increase(node_cpu_seconds_total{mode='idle',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100 ``` ![image-20250321151656996](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321151656996.png) ```sh 计算CPU IO等待时间的1分钟内百分比 (sum(increase(node_cpu_seconds_total{mode='iowait',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100 ``` ![image-20250321151958353](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321151958353.png) ```sh 计算CPU内核态的1分钟内百分比 (sum(increase(node_cpu_seconds_total{mode='system',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100 ``` ![image-20250321152050210](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321152050210.png) 最后选择查看 ![image-20250321152202491](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321152202491.png) ## 9 grafana dashboard备份和恢复 ### 1 备份 #### 1 下载json备份 ![image-20250321153008234](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153008234.png) ![image-20250321153042987](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153042987.png) #### 2 import 备份 ![image-20250321153150346](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153150346.png) ![image-20250321153225170](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153225170.png) ### 2 恢复 #### 1 删除 ![image-20250321153504812](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153504812.png) ![image-20250321153544852](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153544852.png) #### 2 恢复 ![image-20250321153634481](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153634481.png) ![image-20250321153704739](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153704739.png) ![image-20250321153742565](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153742565.png) ![image-20250321153807331](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321153807331.png) ```sh { "annotations": { "list": [ { "builtIn": 1, "datasource": { "type": "grafana", "uid": "-- Grafana --" }, "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": 3, "links": [], "liveNow": false, "panels": [ { "collapsed": true, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, "id": 8, "panels": [ { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "custom": { "align": "auto", "cellOptions": { "type": "auto" }, "inspect": false }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green" }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "instance" }, "properties": [ { "id": "displayName", "value": "服务器IP" } ] }, { "matcher": { "id": "byName", "options": "Time" }, "properties": [ { "id": "displayName", "value": "开机时间" } ] }, { "matcher": { "id": "byName", "options": "nodename" }, "properties": [ { "id": "displayName", "value": "主机名" } ] }, { "matcher": { "id": "byName", "options": "release" }, "properties": [ { "id": "displayName", "value": "内核版本" } ] }, { "matcher": { "id": "byName", "options": "Value #load" }, "properties": [ { "id": "displayName", "value": "负载" } ] }, { "matcher": { "id": "byName", "options": "Value #core" }, "properties": [ { "id": "displayName", "value": "CPU使用率" }, { "id": "custom.cellOptions", "value": { "type": "color-background" } } ] }, { "matcher": { "id": "byName", "options": "Value #Memory" }, "properties": [ { "id": "unit", "value": "bytes" }, { "id": "displayName", "value": "内存总量" } ] }, { "matcher": { "id": "byName", "options": "Value #Memory Used" }, "properties": [ { "id": "displayName", "value": "内存使用率" } ] }, { "matcher": { "id": "byName", "options": "Value #NetWork" }, "properties": [ { "id": "displayName", "value": "网卡流量" }, { "id": "unit", "value": "bytes" } ] } ] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 1 }, "id": 7, "options": { "cellHeight": "sm", "footer": { "countRows": false, "fields": "", "reducer": [ "sum" ], "show": false }, "showHeader": true }, "pluginVersion": "9.5.21", "targets": [ { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "exemplar": false, "expr": "avg(node_uname_info) by (instance,nodename,release)", "format": "table", "instant": true, "legendFormat": "__auto", "range": false, "refId": "kernel" }, { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "exemplar": false, "expr": "node_load5-0", "format": "table", "hide": false, "instant": true, "legendFormat": "__auto", "range": false, "refId": "load" }, { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "exemplar": false, "expr": "(1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[1m])) by (instance)) * 100", "format": "table", "hide": false, "instant": true, "legendFormat": "__auto", "range": false, "refId": "core" }, { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "exemplar": false, "expr": "node_memory_MemTotal_bytes-0", "format": "table", "hide": false, "instant": true, "legendFormat": "__auto", "range": false, "refId": "Memory" }, { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "exemplar": false, "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))*100", "format": "table", "hide": false, "instant": true, "legendFormat": "__auto", "range": false, "refId": "Memory Used" }, { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "exemplar": false, "expr": "max(rate(node_network_transmit_bytes_total[1m])) by (instance)", "format": "table", "hide": false, "instant": true, "legendFormat": "__auto", "range": false, "refId": "NetWork" } ], "title": "服务器集群概览", "transformations": [ { "id": "filterFieldsByName", "options": { "include": { "names": [ "Time", "instance", "nodename", "release", "Value #load", "Value #core", "Value #Memory", "Value #Memory Used", "Value #NetWork" ] } } }, { "id": "merge", "options": {} } ], "transparent": true, "type": "table" } ], "title": "Overview", "type": "row" }, { "collapsed": false, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 1 }, "id": 6, "panels": [], "title": "CPU监控", "type": "row" }, { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "__systemRef": "hideSeriesFrom", "matcher": { "id": "byNames", "options": { "mode": "exclude", "names": [ "10.0.0.41:9100" ], "prefix": "All except:", "readOnly": true } }, "properties": [ { "id": "custom.hideFrom", "value": { "legend": false, "tooltip": false, "viz": true } } ] } ] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 2 }, "id": 5, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "expr": "(sum(increase(node_cpu_seconds_total{mode='iowait',instance=\"$host\"}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100\r\n", "legendFormat": "__auto", "range": true, "refId": "A" } ], "title": "计算CPU IO等待时间的1分钟内百分比", "transparent": true, "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 2 }, "id": 1, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "expr": "(1 - sum(increase(node_cpu_seconds_total{mode='idle',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100\r\n", "legendFormat": "__auto", "range": true, "refId": "A" } ], "title": "统计一分钟内cpu的使用率", "transparent": true, "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 10 }, "id": 4, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "expr": "(sum(increase(node_cpu_seconds_total{mode='iowait',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100\r\n", "legendFormat": "__auto", "range": true, "refId": "A" } ], "title": "计算CPU IO等待时间的1分钟内百分比", "transparent": true, "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 10 }, "id": 3, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "editorMode": "code", "expr": "(sum(increase(node_cpu_seconds_total{mode='system',instance='$host'}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100\r\n", "legendFormat": "__auto", "range": true, "refId": "A" } ], "title": "计算CPU内核态的1分钟内百分比", "transparent": true, "type": "timeseries" } ], "refresh": "", "schemaVersion": 38, "style": "dark", "tags": [], "templating": { "list": [ { "current": { "selected": false, "text": "10.0.0.41:9100", "value": "10.0.0.41:9100" }, "datasource": { "type": "prometheus", "uid": "c2807b1d-1750-4ac6-905d-29467a8acb1f" }, "definition": "label_values({job=\"node-exporter\"},instance)", "description": "需要查看的具体主机的IP地址", "hide": 0, "includeAll": false, "label": "选择查询节点", "multi": false, "name": "host", "options": [], "query": { "query": "label_values({job=\"node-exporter\"},instance)", "refId": "PrometheusVariableQueryEditor-VariableQuery" }, "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 0, "type": "query" } ] }, "time": { "from": "2025-03-21T02:43:07.257Z", "to": "2025-03-21T02:46:44.145Z" }, "timepicker": {}, "timezone": "", "title": "自定义dashboard-cpu", "uid": "f0b29f48-f63b-422f-a9b7-85702ad5a6c0", "version": 4, "weekStart": "" } ``` ## 10 导入第三方Node Exporter ![image-20250321154241515](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321154241515.png) ![image-20250321154425025](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321154425025.png) 去官网查自己想要的模版,记录id号 ![image-20250321154735800](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321154735800.png) ![image-20250321154837132](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321154837132.png) ![image-20250321154901830](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321154901830.png) ![image-20250321154915855](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321154915855.png) 可以每个点进去查看PromQL,然后借鉴 ## 11 node exporter ```sh node exporter整个流程 1 主要作用,采集linux的一些指标,内存,磁盘,网络等,通过http端口9100暴露出去 2 prometheus采集node exporter指标保存到本地,prometheus配置了配置文件,指向了node exporter 3 grafana配置数据源指向prometheus,展示数据 4 grafana 定义变量、dashboard、行信息等 5 从官网导入模版,修改 ``` ## 12 prometheus监控 ### 1 监控windows ```sh 1 官网下载安装包 2 windows运行 ``` ![image-20250321162603689](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321162603689.png) ```sh 3 修改prometheus配置文件 vim /software/prometheus-2.53.3.linux-amd64/prometheus.yml ... - job_name: "windows-exporter" static_configs: - targets: ["192.168.137.254:9182"] 4 热加载prometheus curl -X POST 10.0.0.31:9090/-/reload 5 prometheus 查看 6 官网下载windows模版 20763 ``` ![image-20250321163459795](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321163459795.png) ![image-20250321163754243](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20250321163754243.png) ### 2 监控zookeeper集群 1 zookeeper启用metrc指标 ```sh zookeeper每个阶段执行以下命令 cat >>/software/zookeeper/conf/zoo.cfg<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml< create user ‘exporter’@’%’ identified by ‘123456’ with max_user_connections 3;
Query OK, 0 rows affected (0.08 sec)

mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO ‘exporter’@’%’;
Query OK, 0 rows affected (0.01 sec)

5 启动mysql_expoerter
root@node-exporter43:~# cat >>/etc/.my.cnf<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/sd/file-sd-yaml.yaml <>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml <>/software/prometheus-2.53.3.linux-amd64/sd/file-sd-yaml-node_exporter.yaml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml <>/software/prometheus-2.53.3.linux-amd64/prometheus.yml< /etc/systemd/system/pushgatway.service <>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/alertmanager-0.27.0.linux-amd64/alertmanager.yml<>/software/prometheus-2.53.3.linux-amd64/prometheus.yml<>/software/prometheus-2.53.3.linux-amd64/email-rules.yml< }}
– 指标样本值引用: {{ $value }}

为了显式效果,需要了解一些html相关技术,参考链接:
https://www.w3school.com.cn/html/index.asp

2 altertmanger节点自定义告警模板参考案例
2.1 自定义邮件模板
[root@prometheus-server32]# cat >/software/alertmanager-0.27.0.linux-amd64/email.tmpl<alertmanager_test/

{{ range \$i, \$alert := .Alerts }}

{{ end }}

报警项 实例 报警阀值 开始时间
{{ index \$alert.Labels “alertname” }} {{ index \$alert.Labels “instance” }} {{ index \$alert.Annotations “value” }} {{ \$alert.StartsAt }}

{{ end }}
EOF

2.2 alermanager引用自定义模板文件
[root@prometheus-server32 alertmanager-0.27.0.linux-amd64]# cat >>/software/alertmanager-0.27.0.linux-amd64/alertmanager.yml <>/software/alertmanager-0.27.0.linux-amd64/alertmanager.yml< /etc/systemd/system/victoria-metrics.service <>/software/prometheus-2.53.3.linux-amd64/prometheus.yml< etcd-ca-csr.json < ca-config.json < etcd-csr.json < /software/etcd/etcd.config.yml <<'EOF' name: 'node-exporter41' data-dir: /var/lib/etcd wal-dir: /var/lib/etcd/wal snapshot-count: 5000 heartbeat-interval: 100 election-timeout: 1000 quota-backend-bytes: 0 listen-peer-urls: 'https://10.0.0.41:2380' listen-client-urls: 'https://10.0.0.41:2379,http://127.0.0.1:2379' max-snapshots: 3 max-wals: 5 cors: initial-advertise-peer-urls: 'https://10.0.0.41:2380' advertise-client-urls: 'https://10.0.0.41:2379' discovery: discovery-fallback: 'proxy' discovery-proxy: discovery-srv: initial-cluster: 'node-exporter41=https://10.0.0.41:2380,node-exporter42=https://10.0.0.42:2380,node-exporter43=https://10.0.0.43:2380' initial-cluster-token: 'etcd-k8s-cluster' initial-cluster-state: 'new' strict-reconfig-check: false enable-v2: true enable-pprof: true proxy: 'off' proxy-failure-wait: 5000 proxy-refresh-interval: 30000 proxy-dial-timeout: 1000 proxy-write-timeout: 5000 proxy-read-timeout: 0 client-transport-security: cert-file: '/software/certs/etcd/etcd-server.pem' key-file: '/software/certs/etcd/etcd-server-key.pem' client-cert-auth: true trusted-ca-file: '/software/certs/etcd/etcd-ca.pem' auto-tls: true peer-transport-security: cert-file: '/software/certs/etcd/etcd-server.pem' key-file: '/software/certs/etcd/etcd-server-key.pem' peer-client-cert-auth: true trusted-ca-file: '/software/certs/etcd/etcd-ca.pem' auto-tls: true debug: false log-package-levels: log-outputs: [default] force-new-cluster: false EOF 5.2 node-exporter42节点的配置文件 root@node-exporter42:~# mkdir -pv /software/etcd root@node-exporter42:~# cat > /software/etcd/etcd.config.yml < /software/etcd/etcd.config.yml < /usr/lib/systemd/system/etcd.service <<'EOF' [Unit] Description=Etcd Service Documentation=https://coreos.com/etcd/docs/latest/ After=network.target [Service] Type=notify ExecStart=/usr/local/bin/etcd --config-file=/software/etcd/etcd.config.yml Restart=on-failure RestartSec=10 LimitNOFILE=65536 [Install] WantedBy=multi-user.target Alias=etcd3.service EOF 7.启动etcd集群 systemctl daemon-reload && systemctl enable --now etcd systemctl status etcd 8 查看集群状态 root@node-exporter43:~# etcdctl --endpoints="10.0.0.41:2379,10.0.0.42:2379,10.0.0.43:2379" --cacert=/software/certs/etcd/etcd-ca.pem --cert=/software/certs/etcd/etcd-server.pem --key=/software/certs/etcd/etcd-server-key.pem endpoint status --write-out=table +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | 10.0.0.41:2379 | 9378902f41df91e9 | 3.5.17 | 20 kB | true | false | 2 | 9 | 9 | | | 10.0.0.42:2379 | 18f972748ec1bd96 | 3.5.17 | 25 kB | false | false | 2 | 9 | 9 | | | 10.0.0.43:2379 | a3dfd2d37c461ee9 | 3.5.17 | 20 kB | false | false | 2 | 9 | 9 | | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ 9.验证etcd高可用集群 9.1 停止leader节点 root@node-exporter41:~# systemctl stop etcd root@node-exporter41:~# etcdctl --endpoints="10.0.0.41:2379,10.0.0.42:2379,10.0.0.43:2379" --cacert=/software/certs/etcd/etcd-ca.pem --cert=/software/certs/etcd/etcd-server.pem --key=/software/certs/etcd/etcd-server-key.pem endpoint status --write-out=table {"level":"warn","ts":"2025-03-26T17:22:59.961827+0800","logger":"etcd-client","caller":"v3@v3.5.17/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002845a0/10.0.0.41:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.0.41:2379: connect: connection refused\""} Failed to get the status of endpoint 10.0.0.41:2379 (context deadline exceeded) +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | 10.0.0.42:2379 | 18f972748ec1bd96 | 3.5.17 | 25 kB | true | false | 3 | 10 | 10 | | | 10.0.0.43:2379 | a3dfd2d37c461ee9 | 3.5.17 | 20 kB | false | false | 3 | 10 | 10 | | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ 10 etcd的基本使用 1 配置别名 每个节点执行 root@node-exporter43:~# tail -1 .bashrc alias etcdctl='etcdctl --endpoints="10.0.0.41:2379,10.0.0.42:2379,10.0.0.43:2379" --cacert=/software/certs/etcd/etcd-ca.pem --cert=/software/certs/etcd/etcd-server.pem --key=/software/certs/etcd/etcd-server-key.pem' root@node-exporter43:~#etcdctl endpoint status --write-out=table 2 查看集群状态 2.1 写入数据KEY的school,value等于xiaoxue [root@node-exporter42 ~]# etcdctl put school xiaoxue OK [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl put scheduler 调度器 OK [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl put class Linux OK [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl put service 服务 OK [root@node-exporter42 ~]# 2.2 查看数据 [root@node-exporter42 ~]# etcdctl get school school xiaoxue [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl get school --keys-only school [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl get school --print-value-only xiaoxue [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl get sch --prefix --keys-only # 匹配"sch"开头的key scheduler school [root@node-exporter42 ~]# [root@node-exporter42 ~]# 2.3 修改数据 [root@node-exporter42 ~]# etcdctl get school --print-value-only xiaoxue [root@node-exporter42 ~]# [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl put school laonanhai # 如果key的值是存在,则直接覆盖 OK [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl get school --print-value-only laonanhai [root@node-exporter42 ~]# 2.4 删除数据 [root@node-exporter42 ~]# etcdctl get sch --prefix --keys-only scheduler school [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl del school 1 [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl get sch --prefix --keys-only scheduler [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl del sch --prefix 1 [root@node-exporter42 ~]# [root@node-exporter42 ~]# etcdctl get sch --prefix --keys-only ``` ## 24 etcd数据备份恢复 ```sh 1 拷贝程序 root@node-exporter43:~# scp /usr/local/bin/etcd* 10.0.0.32:/usr/local/bin/ 2.准备证书文件 2.1 安装cfssl证书管理工具 root@prometheus-server32:~# unzip cfssl-v1.6.5.zip root@prometheus-server32:~# rename -v "s/_1.6.5_linux_amd64//g" cfssl* root@prometheus-server32:~# mv cfssl* /usr/local/bin/ root@prometheus-server32:~# chmod +x /usr/local/bin/cfssl* 3 创建证书存储目录 root@prometheus-server32:~# mkdir -pv /software/certs/etcd && cd /software/certs/etcd/ 4 生成证书的CSR文件: 证书签发请求文件,配置了一些域名,公司,单位 root@prometheus-server32:/software/certs/etcd# cat > etcd-ca-csr.json < ca-config.json < etcd-csr.json < /software/etcd/etcd.config.yml <<'EOF' name: 'prometheus-server32' data-dir: /var/lib/etcd wal-dir: /var/lib/etcd/wal snapshot-count: 5000 heartbeat-interval: 100 election-timeout: 1000 quota-backend-bytes: 0 listen-peer-urls: 'https://10.0.0.32:2380' listen-client-urls: 'https://10.0.0.32:2379,http://127.0.0.1:2379' max-snapshots: 3 max-wals: 5 cors: initial-advertise-peer-urls: 'https://10.0.0.32:2380' advertise-client-urls: 'https://10.0.0.32:2379' discovery: discovery-fallback: 'proxy' discovery-proxy: discovery-srv: initial-cluster: 'prometheus-server32=https://10.0.0.32:2380' initial-cluster-token: 'etcd-k8s-cluster' initial-cluster-state: 'new' strict-reconfig-check: false enable-v2: true enable-pprof: true proxy: 'off' proxy-failure-wait: 5000 proxy-refresh-interval: 30000 proxy-dial-timeout: 1000 proxy-write-timeout: 5000 proxy-read-timeout: 0 client-transport-security: cert-file: '/software/certs/etcd/etcd-server.pem' key-file: '/software/certs/etcd/etcd-server-key.pem' client-cert-auth: true trusted-ca-file: '/software/certs/etcd/etcd-ca.pem' auto-tls: true peer-transport-security: cert-file: '/software/certs/etcd/etcd-server.pem' key-file: '/software/certs/etcd/etcd-server-key.pem' peer-client-cert-auth: true trusted-ca-file: '/software/certs/etcd/etcd-ca.pem' auto-tls: true debug: false log-package-levels: log-outputs: [default] force-new-cluster: false EOF 10 准备启动脚本 root@prometheus-server32:/software/certs/etcd#cat > /usr/lib/systemd/system/etcd.service <<'EOF' [Unit] Description=Etcd Service Documentation=https://coreos.com/etcd/docs/latest/ After=network.target [Service] Type=notify ExecStart=/usr/local/bin/etcd --config-file=/software/etcd/etcd.config.yml Restart=on-failure RestartSec=10 LimitNOFILE=65536 [Install] WantedBy=multi-user.target Alias=etcd3.service EOF 11 启动etcd集群 systemctl daemon-reload && systemctl enable --now etcd systemctl status etcd 12 添加别名 root@prometheus-server32:/software/certs/etcd# tail -1 /root/.bashrc alias etcdctl='etcdctl --endpoints="10.0.0.32:2379" --cacert=/software/certs/etcd/etcd-ca.pem --cert=/software/certs/etcd/etcd-server.pem --key=/software/certs/etcd/etcd-server-key.pem' root@prometheus-server32:/software/certs/etcd# source /root/.bashrc 13 查看etcd状态 root@prometheus-server32:/software/certs/etcd# etcdctl endpoint status --write-out=table +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | 10.0.0.32:2379 | b58958a430a55e35 | 3.5.17 | 25 kB | true | false | 2 | 4 | 4 | | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ 14 43节点创建快照 root@node-exporter43:~# \etcdctl snapshot save /tmp/etcd.backup 15 将快照文件发送到32节点 root@node-exporter43:~# scp /tmp/etcd.backup root@10.0.0.32:~ 16 32节点停止etcd服务 root@prometheus-server32:~# systemctl stop etcd 17 32节点备份etcd的源数据目录 root@prometheus-server32:~# mv /var/lib/etcd/ /var/lib/etcd-bak 18 32节点恢复数据【恢复的数据目录必须为空】 root@prometheus-server32:~# etcdctl snapshot restore etcd.backup --data-dir=/var/lib/etcd/ root@prometheus-server32:~# ll /var/lib/etcd total 12 drwx------ 3 root root 4096 Mar 26 21:57 ./ drwxr-xr-x 43 root root 4096 Mar 26 21:57 ../ drwx------ 4 root root 4096 Mar 26 21:57 member/ 19 32节点启动etcd服务 root@prometheus-server32:~# systemctl start etcd root@prometheus-server32:~# systemctl status etcd ● etcd.service - Etcd Service Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2025-03-26 21:59:24 CST; 4s ago .... 20 验证数据是否恢复成功 root@prometheus-server32:~# etcdctl get service service 服务 ```

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注