Node Problem Detector¶

Node Problem Detector, abbreviated as NPD, is an open-source cluster node monitoring plugin for Kubernetes used for node failure checks.

NPD Features:

Generate event information, report to APIServer.
Detect metric information, output as Metrics.

Configuration¶

Prerequisites¶

Install K8S environment
Install DataKit

Install NPD¶

You can refer to the installation documentation. Here we use the yaml method for installation.

kubectl apply -f rbac.yaml
kubectl apply -f node-problem-detector-config.yaml
kubectl apply -f node-problem-detector.yaml

Log Event Pattern¶

By default, after installing NPD, no additional configuration is required. Datakit collects Kubernetes events by default and stores them in the reason (tag) of logs with the data source Kubernetes_events.

Metric Pattern¶

In addition to the event pattern, NPD also supports outputting metrics.

Preparations¶

Install Prometheus Operator

Enable `ServiceMonitor` in DataKit¶

Automatically discover Pod/Service Prometheus metrics

The following collects NPD metric information through the ServiceMonitor method.

Modify node-problem-detector.yaml

...
      - name: node-problem-detector
        command:
        - /node-problem-detector
        - --logtostderr
        - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json
        - --address=0.0.0.0
        - --prometheus-address=0.0.0.0
...
        ports:
        - containerPort: 20257
          hostPort: 20257
          name: man-port

Create npd-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: node-problem-detector
  namespace: kube-system
  labels:
    app: node-problem-detector
spec:
  selector:
    app: node-problem-detector
  ports:
    - protocol: TCP
      port: 20257
      targetPort: 20257
      name: metrics

Create npd-server-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: npd-server-metrics
  labels:
    app: node-problem-detector
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: node-problem-detector
  endpoints:
  - port: metrics
    params:
      measurement:
        - node-problem-detector

Execute

kubectl apply -f rbac.yaml
kubectl apply -f node-problem-detector-config.yaml
kubectl apply -f node-problem-detector.yaml
kubectl apply -f npd-service.yaml
kubectl apply -f npd-server-monitor.yaml

Metrics¶

problem_gauge¶

Tag	Description
`type="ConntrackFullProblem"`	Node connection tracking table issue
type="EmptyDirVolumeGroupStatusError"	Temporary volume storage pool issue
type="MemoryProblem"	Node memory issue
type="LocalPvVolumeGroupStatusError"	Persistent volume storage pool issue on the node
type="MountPointProblem"	Node mount point issue
type="FDProblem"	System critical resource FD file handle count issue
type="DiskHung"	Whether the node disk has IO hang
type="DiskReadonly"	Whether the node disk is read-only
type="DiskProblem"	Node system disk issue
type="DiskSlow"	Whether the node disk has slow IO issue
type="FrequentCRIRestart"	CRI frequent restarts
type="FrequentDockerRestart"	Docker frequent restarts
type="FrequentKubeletRestart"	Kubelet frequent restarts
type="FrequentContainerdRestart"	Containerd frequent restarts
type="NTPProblem"	`ntpd` synchronization anomaly
type="PIDProblem"	Insufficient system critical resource PID process resources
type="`ResolvConfFileProblem`"	`ResolvConf` configuration anomaly
type="CNIProblem"	CNI (container network) component anomaly
type="CRIProblem"	Component CRI (container runtime component) Docker or Containerd operational state anomaly
type="`KUBEPROXYProblem`"	Kube-proxy operational anomaly
type="`KUBELETProblem`"	Kubelet status anomaly
type="ScheduledEvent"	Whether there are scheduled events on the host
type="ProcessD"	Node has D processes
type="ProcessZ"	Node has Z processes

Logs¶

The following list includes but is not limited to the events that NPD can detect under the default configuration. Events are written into logs with the data source Kubernetes_events:

Reason	Persistence	Description
`DockerHung`	Yes	Docker hung or unresponsive
`ReadonlyFilesystem`	Yes	File system mounted as read-only mode, usually a protection mechanism preventing file system corruption under certain circumstances
`CorruptDockerOverlay2`	Yes	Issue with Overlay2 storage driver
`ContainerdUnhealthy`	Yes	Containerd in unhealthy state
`KubeletUnhealthy`	Yes	Kubelet in unhealthy state
`DockerUnhealthy`	Yes	Docker in unhealthy state
`OOMKilling`	No	Kubernetes terminates Pod due to OOM
`TaskHung`	No	Task hung
`UnregisterNetDevice`	No	Network interface abnormality
`KernelOops`	No	Kernel detects abnormal behavior, such as null pointer, device error
`Ext4Error`	No	Ext4 file system issue
`Ext4Warning`	No	Ext4 file system issue
`IOError`	No	Buffer issue
`MemoryReadError`	No	Correctable memory error, frequent occurrences indicate potential hardware issues
`KubeletStart`	No	Kubelet start, frequent occurrence means Kubelet frequently restarts
`DockerStart`	No	Docker start, frequent occurrence means Docker frequently restarts
`ContainerdStart`	No	Containerd start, frequent occurrence means Containerd frequently restarts
`CorruptDockerImage`	No	Directory used by Docker registry is not empty
`DockerContainerStartupFailure`	No	Docker fails to start
`ConntrackFull`	No	Network connection tracking limit reached, will affect NAT, firewall, etc., network functions
`NTPIsDown`	No	NTP time synchronization anomaly