使用夜莺搭建K8s监控-实施篇
使用夜莺搭建K8s监控-实施篇
安装 n9e & categraf
安装n9e
# 下载 n9e
wget https://download.flashcat.cloud/n9e-v7.6.0-linux-amd64.tar.gz
mkdir -p /opt/n9e/server
tar zxvf n9e-v7.6.0-linux-amd64.tar.gz -C /opt/n9e/server
# 下载 categraf
wget https://github.com/flashcatcloud/categraf/releases/download/v0.3.82/categraf-v0.3.82-linux-amd64.tar.gz
tar zxvf categraf-v0.3.82-linux-amd64.tar.gz -C /opt/n9e/
mv /opt/n9e/categraf-v0.3.82-linux-amd64 /opt/n9e/categraf
# systemd 托管
./categraf --install
systemctl enable categraf
systemctl start categraf
# 安装 mariadb
yum -y install mariadb*
# 启动mysql进程
systemctl start mariadb.service
# 将mysql设置为开机自启动
systemctl enable mariadb.service
# 设置mysql root密码
mysql -e "SET PASSWORD FOR 'root'@'localhost' = PASSWORD('1234');"
# 导入 n9e 初始化SQL脚本
mysql -uroot -p1234 < /opt/n9e/server/n9e.sql
# 安装 redis
yum install epel-release
yum install -y redis
# 修改 requirepass
vi /etc/redis.conf
systemctl start redis
systemctl enable redis
# 安装prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.53.2/prometheus-2.53.2.linux-amd64.tar.gz
tar zxvf prometheus-2.53.2.linux-amd64.tar.gz -C /opt/n9e
mv /opt/n9e/prometheus-2.53.2.linux-amd64 /opt/n9e/prometheus
修改 n9e 配置文件
建议开启 APIForAgent 和 APIForService 并修改密码。
cd /opt/n9e/server
vim etc/config.toml
配置prometheus basic auth
参考,prometheus 文件夹下创建 web.yml
basic_auth_users:
admin: $2b$12$cxsGcJV75SE.eIzN9M1m.OYqd18leOAANtfjpM7SxIzeud7Jo415K
配置systemd托管
# prometheus
cat <<EOF >/etc/systemd/system/prometheus.service
[Unit]
Description="prometheus"
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
ExecStart=/opt/n9e/prometheus/prometheus --config.file=/opt/n9e/prometheus/prometheus.yml --storage.tsdb.path=/opt/n9e/prometheus/data --web.enable-lifecycle --enable-feature=remote-write-receiver --query.lookback-delta=2m --web.config.file=/opt/n9e/prometheus/web.yml --web.enable-admin-api
Restart=on-failure
RestartSecs=5s
SuccessExitStatus=0
LimitNOFILE=65536
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=prometheus
[Install]
WantedBy=multi-user.target
EOF
# n9e
cat <<EOF >/etc/systemd/system/n9e.service
[Unit]
Description="n9e"
After=network.target
[Service]
Type=simple
ExecStart=/opt/n9e/server/n9e -configs /opt/n9e/server/etc/
Restart=on-failure
SuccessExitStatus=0
LimitNOFILE=65536
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=n9e
[Install]
WantedBy=multi-user.target
EOF
启动
systemctl start prometheus
systemctl enable prometheus
systemctl start n9e
systemctl enable n9e
节点 categraf 配置
修改 categraf 文件夹下 conf/config.toml
文件,修改 hostname、添加一个global.labels,writers url和 heartbeat url 修改为 n9e 服务端地址并配置 basic auth
# 注册 systemd 托管
./categraf --install
systemctl start categraf
配置nginx vts监控
vts 下载 下载完成后解压,在 configure nginx 的时候添加参数 --add-module=/opt/nginx-module-vts,然后正常 make 和 make install。 完成后修改 nginx 配置文件,重启 nginx 访问 /status 路径,生产环境建议设置白名单。
http {
vhost_traffic_status_zone;
server {
location /status {
allow 192.168.200.146;
deny all;
vhost_traffic_status_display;
vhost_traffic_status_display_format html;
}
}
}
配置 categraf 配置文件
修改 categraf 下conf/input.prometheus/prometheus.toml
,在 instances urls 下面添加 nginx 里对应 status 的链接 https://xxxx.cn/status/format/prometheus
,重启 categraf。
配置 k8s 监控
我的集群版本是 v1.20.15,这里我使用 rancher 里商店来部署 prometheus,镜像可能拉取不下来,我这里是先把镜像保存到 harbor 然后修改 helm 模版中镜像信息。
- 商店添加 rancher 的源,
https://git.rancher.io/charts
,或者使用 helm 命令下载 prometheus 到本地编辑完 value.yaml 再部署。
helm repo add rancher-charts https://git.rancher.io/charts
helm repo update
helm search repo rancher-charts/prometheus
helm pull rancher-charts/prometheus --untar
# 修改 value.yaml
# 安装
helm install my-prometheus . -n monitor
- 找到 prometheus,填写必要信息。 k8s中的 prometheus 直接 remote write 可能会遇到因为时区不一致导致数据无法写入的问题,这里可以把 remote write 的url配置为n9e的write地址(http://xxxx:17000/prometheus/v1/write?ignore_host=true),数据转发到n9e后,n9e将时间戳修改为服务器的时间再写到时序库,ignore_host 作用是忽略指标上带的 host 标签,因为带了 host 标签会被 n9e 判定为一台机器。 prometheus-server.yml
apiVersion: v1
data:
alerts: '{}'
prometheus.yml: |-
global:
evaluation_interval: 1m
scrape_interval: 1m
scrape_timeout: 10s
rule_files:
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: kubernetes_node
- honor_labels: true
job_name: prometheus-pushgateway
kubernetes_sd_configs:
- role: service
relabel_configs:
- action: keep
regex: pushgateway
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
- job_name: kubernetes-services
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module:
- http_2xx
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
- source_labels:
- __address__
target_label: __param_target
- replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
target_label: __address__
- source_labels:
- __param_target
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- action: replace
source_labels:
- __meta_kubernetes_pod_label_address_ip
- __meta_kubernetes_pod_label_address_port
- __meta_kubernetes_pod_label_address_service
target_label: address
regex: (.+);(.+);(.+)
replacement: $1:$2/$3
- job_name: 'kubernetes-ingresses'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: ingress
relabel_configs:
- source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__,__meta_kubernetes_ingress_path]
regex: (.+);(.+)
replacement: https://${1}${2}
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_ingress_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_ingress_name]
target_label: kubernetes_name
remote_write:
- url: "http://192.168.200.146:9090/api/v1/write"
basic_auth:
username: "admin"
password: "admin"
queue_config:
max_samples_per_send: 1000
batch_send_deadline: 5s
max_shards: 20
write_relabel_configs:
- source_labels: []
target_label: "source"
replacement: "k8s"
rules: '{}'
kind: ConfigMap
metadata:
name: prometheus-server
配置 oracle 监控
使用镜像 ghcr.io/iamseth/oracledb_exporter:0.6.0
,deploment 参考:
oracledb_exporter.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: oracledb-exporter
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
workload.user.cattle.io/workloadselector: deployment-prometheus-oracledb-exporter
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "9161"
prometheus.io/scrape: "true"
labels:
address-ip: 192.168.200.130
address-port: "1521"
address-service: orcl
region: dataassets
workload.user.cattle.io/workloadselector: deployment-prometheus-oracledb-exporter
spec:
containers:
- env:
- name: CUSTOM_METRICS
value: /tmp/custom-metrics.toml
- name: DATA_SOURCE_NAME
value: oracle://system:xxxx@192.168.200.130:1521/orcl
image: oracledb_exporter:0.6.0
imagePullPolicy: IfNotPresent
name: oracledb-exporter
ports:
- containerPort: 9161
name: 9161tcp2
protocol: TCP
resources: {}
securityContext:
allowPrivilegeEscalation: false
privileged: false
readOnlyRootFilesystem: false
runAsNonRoot: false
stdin: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
tty: true
volumeMounts:
- mountPath: /tmp/custom-metrics.toml
name: custom-metrics
subPath: custom-metrics.toml
dnsPolicy: ClusterFirst
restartPolicy: Always
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 292
name: custom-metrics
optional: false
name: custom-metrics
custom-metrics.yml:
apiVersion: v1
data:
custom-metrics.toml: |-
[[metric]]
context = "slow_queries"
metricsdesc = { p95_time_usecs= "Gauge metric with percentile 95 of elapsed time.", p99_time_usecs= "Gauge metric with percentile 99 of elapsed time." }
request = "select percentile_disc(0.95) within group (order by elapsed_time) as p95_time_usecs, percentile_disc(0.99) within group (order by elapsed_time) as p99_time_usecs from v$sql where last_active_time >= sysdate - 5/(24*60)"
[[metric]]
context = "big_queries"
metricsdesc = { p95_rows= "Gauge metric with percentile 95 of returned rows.", p99_rows= "Gauge metric with percentile 99 of returned rows." }
request = "select percentile_disc(0.95) within group (order by rownum) as p95_rows, percentile_disc(0.99) within group (order by rownum) as p99_rows from v$sql where last_active_time >= sysdate - 5/(24*60)"
[[metric]]
context = "size_dba_segments_top100"
metricsdesc = {table_bytes="Gauge metric with the size of the tables in user segments."}
labels = ["segment_name"]
request = "select * from (select segment_name,sum(bytes) as table_bytes from dba_segments where segment_type='TABLE' group by segment_name) where ROWNUM <= 100 order by table_bytes DESC"
[[metric]]
context = "size_dba_segments_top100"
metricsdesc = {table_partition_bytes="Gauge metric with the size of the table partition in user segments."}
labels = ["segment_name"]
request = "select * from (select segment_name,sum(bytes) as table_partition_bytes from dba_segments where segment_type='TABLE PARTITION' group by segment_name) where ROWNUM <= 100 order by table_partition_bytes DESC"
[[metric]]
context = "size_dba_segments_top100"
metricsdesc = {cluster_bytes="Gauge metric with the size of the cluster in user segments."}
labels = ["segment_name"]
request = "select * from (select segment_name,sum(bytes) as cluster_bytes from dba_segments where segment_type='CLUSTER' group by segment_name) where ROWNUM <= 100 order by cluster_bytes DESC"
[[metric]]
context = "cache_hit_ratio"
metricsdesc = {percentage="Gauge metric with the cache hit ratio."}
request = "select Round(((Sum(Decode(a.name, 'consistent gets', a.value, 0)) + Sum(Decode(a.name, 'db block gets', a.value, 0)) - Sum(Decode(a.name, 'physical reads', a.value, 0)) )/ (Sum(Decode(a.name, 'consistent gets', a.value, 0)) + Sum(Decode(a.name, 'db block gets', a.value, 0)))) *100,2) as percentage FROM v$sysstat a"
[[metric]]
context = "startup"
metricsdesc = {time_seconds="Database startup time in seconds."}
request = "SELECT (SYSDATE - STARTUP_TIME) * 24 * 60 * 60 AS time_seconds FROM V$INSTANCE"
[[metric]]
context = "lock"
metricsdesc = {cnt="The number of locked database objects."}
request = "SELECT COUNT(*) AS cnt FROM ALL_OBJECTS A, V$LOCKED_OBJECT B, SYS.GV_$SESSION C WHERE A.OBJECT_ID = B.OBJECT_ID AND B.PROCESS = C.PROCESS"
[[metric]]
context = "archivelog"
metricsdesc = {count="Number of archived logs that have not been deleted."}
request="select COUNT(*) as count from v$archived_log where DELETED = 'NO'"
kind: ConfigMap
metadata:
name: custom-metrics
配置 blackbox_exporter
上面在prometheus的配置文件中已经配置了 service 及 ingress 转发到 blackbox_exporter 的内容,我这边域名都是 https 协议所以将直接写成了下面这样,想要监控哪个ingress 或 service 规则,在对应对象上添加注解prometheus.io/probe: true
- source_labels:
[__address__,__meta_kubernetes_ingress_path]
regex: (.+);(.+)
replacement: https://${1}${2}
如果你是 http 的可以使用下面配置,获取你 ingress 设置的scheme
- source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
regex: (.+);(.+);(.+)
replacement: ${1}://${2}${3}
blackbox.yaml
---
apiVersion: v1
kind: Service
metadata:
labels:
app: blackbox-exporter
name: blackbox-exporter
spec:
ports:
- name: blackbox
port: 9115
protocol: TCP
selector:
app: blackbox-exporter
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: blackbox-exporter
name: blackbox-exporter
spec:
replicas: 1
selector:
matchLabels:
app: blackbox-exporter
template:
metadata:
labels:
app: blackbox-exporter
spec:
containers:
- image: blackbox-exporter:latest
imagePullPolicy: IfNotPresent
name: blackbox-exporter
报错 User "system:serviceaccount:prometheus:prometheus-kube-state-metrics" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
# 修改 ClusterRole/prometheus-kube-state-metrics
kubectl edit ClusterRole/prometheus-kube-state-metrics
# 添加
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
verbs:
- list
- get
- watch
报错 User "system:serviceaccount:prometheus:prometheus-kube-state-metrics" cannot list resource "replicasets" in API group "apps" at the cluster scope
# 修改 ClusterRole/prometheus-kube-state-metrics
kubectl edit ClusterRole/prometheus-kube-state-metrics
# apps apiGroups resources 添加
- replicasets