Kubernetes Pod Abnormal Restart Inspection¶
Background¶
Kubernetes helps users automatically schedule and scale containerized applications, but modern Kubernetes environments are becoming increasingly complex. When platform and application engineers need to investigate events in dynamic, containerized environments, finding the most meaningful signals may involve many trial-and-error steps. Through intelligent inspection, anomalies can be filtered based on the current search context, thereby accelerating event investigation, reducing the pressure on engineers, decreasing mean time to repair, and improving end-user experience.
Prerequisites¶
- Enable " Container Data Collection" in Guance
- Self-host DataFlux Func Guance Special Edition or activate DataFlux Func (Automata)
- In Guance "Management / API Key Management," create an API Key for operations.
Start Inspection¶
In your self-hosted DataFlux Func, install "Guance Self-built Inspection (K8S-Pod Restart Detection)" via the "Script Market" and configure the Guance API Key to start it.
In the DataFlux Func Script Market, select the inspection scenario you want to enable, click Install, configure the Guance API Key and GuanceNode connection, then choose to deploy and start the script.
After successfully deploying the startup script, it will automatically create the startup script and trigger configuration. You can directly jump to view the corresponding configuration through the link.
Configure Inspection¶
Configuration of Inspection in Guance¶
Enable/Disable¶
Intelligent inspection is disabled by default and can be manually enabled. After enabling, it will begin inspecting Pods in the configured Kubernetes clusters.
Edit¶
The "Kubernetes Pod Abnormal Restart Inspection" supports users adding filter conditions manually. Under the operation menu on the right side of the intelligent inspection list, click the Edit button to edit the inspection template.
- Filter Conditions: Configure the cluster_name (cluster name, optional, if not configured, all namespaces will be checked) and namespace (namespace, required) for the Kubernetes environment to be inspected.
- Alert Notifications: Supports selecting and editing alert strategies, including event severity levels, notification targets, and alert mute periods.
To configure entry parameters, click Edit, fill in the corresponding detection objects, and save to start the inspection:
You can refer to the following JSON configuration for multiple clusters and namespace information:
// Configuration example: namespace can be configured as multiple or single
configs =[
{"cluster_name": "xxx", "namespace": ["xxx1", "xxx2"]},
{"cluster_name": "yyy","namespace": "yyy1"}
]
Note: In your self-hosted DataFlux Func, when writing custom inspection processing functions, you can also add filter conditions (refer to the sample code configuration). However, note that parameters configured in Guance Studio will override those set in the custom inspection processing function.
Configuration of Inspection in DataFlux Func¶
In DataFlux Func, after configuring the necessary filter conditions for inspection, you can test by clicking the run()
method directly on the page. After publishing, the script will execute normally. You can also view or modify configurations in Guance "Monitoring / Intelligent Inspection."
from guance_monitor__runner import Runner
from guance_monitor__register import self_hosted_monitor
import guance_monitor_k8s_pod_restart__main as main
# Support for using filtering functions to filter the objects being inspected, for example:
def filter_namespace(cluster_namespaces):
'''
Filter hosts; define the condition for a host that meets the requirement. Matching returns True, non-matching returns False
return True|False
'''
cluster_name = cluster_namespaces.get('cluster_name','')
namespace = cluster_namespaces.get('namespace','')
if cluster_name in ['xxxx']:
return True
@self_hosted_monitor(account['api_key_id'], account['api_key'])
@DFF.API('K8S-Pod Restart Detection Inspection', fixed_crontab='*/30 * * * *', timeout=900)
def run(configs=None):
"""
Optional parameters:
configs: (if not configured, it will check all by default; follow the content below if configured)
Configure the cluster_name (cluster name, optional, if not configured, it will check based on namespace)
Configure the namespace (namespace, required)
Example: namespace can be configured as multiple or single
configs =[
{"cluster_name": "xxx", "namespace": ["xxx1", "xxx2"]},
{"cluster_name": "yyy","namespace": "yyy1"}
]
"""
checkers = [
k8s_pod_restart.K8SPodRestartCheck(configs=configs, filters=[filter_namespace]), # Support for user-configured multiple filtering functions executed sequentially.
]
Runner(checkers, debug=False).run()
View Events¶
Based on the inspection algorithm of Guance, Intelligent Inspection will look for abnormal Pod restarts within the currently configured clusters. For abnormal situations, Intelligent Inspection will generate corresponding events. Under the operation menu on the right side of the Intelligent Inspection list, click the View Related Events button to see the corresponding abnormal events.
Event Details Page¶
By clicking Event, you can view the details page of the Intelligent Inspection event, including event status, the time of the anomaly occurrence, the anomaly name, basic attributes, event details, alert notifications, history records, and related events.
- Click the small icon in the top-right corner of the details page labeled "View Monitor Configuration" to support viewing and editing the detailed configuration of the current Intelligent Inspection.
Basic Attributes¶
- Detection Dimensions: Based on the filter conditions configured in Intelligent Inspection, you can copy
key/value
, add them to filters, and view related logs, containers, processes, security checks, traces, user analysis, synthetic tests, and CI data. - Extended Attributes: After selecting extended attributes, you can copy them in
key/value
form, perform forward/reverse filtering.
Event Details¶
- Event Overview: Describes the object and content of the abnormal inspection event.
- Abnormal Pods: You can view the status of abnormal pods under the current namespace.
- Container Status: You can view detailed error times, container ID status, current resource conditions, and container types; clicking on the container ID will take you to the specific container detail page.
History Records¶
Supports viewing detected objects, anomaly/recovery times, and duration.
Related Events¶
Supports viewing associated events by filtering fields and selected time component information.
Kubernetes Metrics¶
Through the Kubernetes monitoring view in the event, you can see more granular information about corresponding anomalies and possible influencing factors.
Common Issues¶
1. How to configure the detection frequency of Kubernetes Pod Abnormal Restart Inspection
- In your self-hosted DataFlux Func, when writing custom inspection processing functions, add
fixed_crontab='*/30 * * * *', timeout=900
in the decorator, then configure it in "Management / Automatic Trigger Configuration."
2. Why might there be no anomaly analysis when Kubernetes Pod Abnormal Restart Inspection triggers
If there is no anomaly analysis in the inspection report, please check the data collection status of the current datakit
.
3. Under what circumstances does a Kubernetes Pod Abnormal Restart Inspection event occur
Use the ratio of restarted pods in cluster_name + namespace as the entry point. When this metric rises in the last 30 minutes, it triggers the event generation logic and performs root cause analysis.
4. What to do when previously normal scripts encounter errors during inspection
Update the referenced script set in the DataFlux Func Script Market. You can check the update records of the Script Market through the Change Log to facilitate timely updates of scripts.
5. Why does the corresponding script set in Startup remain unchanged during script upgrade
Delete the corresponding script set first, then click the Upgrade button to reconfigure the Guance API key to complete the upgrade.
6. How to determine if the inspection has taken effect after activation
In "Management / Automatic Trigger Configuration," view the corresponding inspection status. The status should first be enabled, and secondly, you can verify the inspection script by clicking Execute. If the message "Executed successfully xxx minutes ago" appears, the inspection is running normally.