Node Problem Detector¶
Node Problem Detector , abbreviated as NPD, is an open-source cluster node monitoring plugin for Kubernetes, used for node fault detection.
NPD Function:
-
Generate event information and report it to APIServer.
-
Detect indicator information, output as Metrics.
Installation Configuration{#config}¶
Preconditions{#requirement}¶
- Installed K8S
- Installed DataKit
- Installed Prometheus Operator
DataKit 开启 ServiceMonitor
¶
Automatically Discover the Service Exposure Metrics Interface
Collect NPD
indicator information through the ServiceMonitor
method below.
Installed NPD¶
Can install documentation Here, the yaml
method is used for installation.
-
Download yaml
-
Modify node-problem-detector.yaml
...
- name: node-problem-detector
command:
- /node-problem-detector
- --logtostderr
- --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json
- --address=0.0.0.0
- --prometheus-address=0.0.0.0
...
ports:
- containerPort: 20257
hostPort: 20257
name: man-port
- Created
npd-service.yaml
apiVersion: v1
kind: Service
metadata:
name: node-problem-detector
namespace: kube-system
labels:
app: node-problem-detector
spec:
selector:
app: node-problem-detector
ports:
- protocol: TCP
port: 20257
targetPort: 20257
name: metrics
- Created
npd-server-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: npd-server-metrics
labels:
app: node-problem-detector
namespace: kube-system
spec:
selector:
matchLabels:
app: node-problem-detector
endpoints:
- port: metrics
params:
measurement:
- node-problem-detector
- Run
kubectl apply -f rbac.yaml
kubectl apply -f node-problem-detector-config.yaml
kubectl apply -f node-problem-detector.yaml
kubectl apply -f npd-service.yaml
kubectl apply -f npd-server-metrics.yaml
Metric¶
problem_gauge¶
Tag | Description |
---|---|
type="ConntrackFullProblem" |
Node link tracking table failure |
type="EmptyDirVolumeGroupStatusError" | Temporary volume storage pool failure |
type="MemoryProblem" | Node memory failure |
type="LocalPvVolumeGroupStatusError" | Persistent volume storage pool failure on node |
type="MountPointProblem" | Node mount point failure |
type="FDProblem" | System critical resource FD file handle count failure |
type="DiskHung" | Is there card IO present on the node disk |
type="DiskReadonly" | Is the node disk read-only |
type="DiskProblem" | Node system disk failure |
type="DiskSlow" | Is there a slow IO fault on the node disk |
type="FrequentCRIRestart" | CRI frequently restarts |
type="FrequentDockerRestart" | Docker frequently restarts |
type="FrequentKubeletRestart" | Kubelet frequently restarts |
type="FrequentContainerdRestart" | Containerd frequently restarts |
type="NTPProblem" | ntpd synchronization exception |
type="PIDProblem" | Insufficient PID process resources for system critical resources |
type="ResolvConfFileProblem " |
ResolvConf configuration exception |
type="CNIProblem" | CNI (Container Network) component exception |
type="CRIProblem" | Abnormal running status of component CRI (container runtime component) Docker or Container |
type="KUBEPROXYProblem " |
Kube proxy running abnormally |
type="KUBELETProblem " |
Kubelet status is abnormal |
type="ScheduledEvent" | Does the node have host schedule events |
type="ProcessD" | Node has D processes present |
type="ProcessZ" | Node has Z processes present |