Skip to content

Unrecovered Incidents


The Unrecovered Incidents Explorer centrally displays all incident records at the alert level within the current workspace, helping users fully understand the context of alert incidents, accelerating the comprehension and recognition of incidents, and effectively reducing alert fatigue by associating monitors and alert strategies.

The Unrecovered Incidents data source queries incident data, aggregates it using df_fault_id as the unique identifier, and displays the most recent results. Leveraging the Explorer's visualization capabilities, you can intuitively grasp a series of key data points, from incident severity levels to triggered threshold baselines. Information spanning incident severity, duration, alert notifications, monitors, incident content, and historical triggering trend charts collectively forms a comprehensive view. This empowers you to analyze and understand incidents from various angles, enabling more informed response decisions.

Incident Card

Incident Severity Level

Based on the trigger condition configuration of the monitor, the following status statistics are generated:

  • Fatal (fatal)
  • Critical (critical)
  • Error (error)
  • Warning (warning)
  • No Data (nodata)

In the Unrecovered Incidents Explorer, the severity level of each incident is defined as the level at the most recent trigger time for that detected object.

For more details, refer to Incident Severity Level Description.

Incident Title

The incident title displayed in the Unrecovered Incidents Explorer comes directly from the title set during the monitor rule configuration. It represents the title used for the most recent trigger of the incident for that detected object.

Duration

Indicates the elapsed time from the first abnormal trigger generating the incident for the current detected object until the end time of the current time widget, e.g., 5 minutes (08/20 17:53:00 ~ 17:57:38).

Alert Notification

The alert notification status for the most recent trigger of the incident for the current detected object. It primarily includes the following three states:

  • Mute: Indicates the current incident is affected by a mute rule but no external alert notification was sent.
  • Identifier of the actual notified target: Includes DingTalk bot, WeCom bot, Lark bot, etc.
  • -: No external alert notification was triggered.

Monitor Detection Type

Refers to the monitor type.

Detected Object

If a by grouped query was used at the detection metric during monitor rule configuration, the incident card will display the filter condition, e.g., source:kodo-servicemap.

Incident Content

The incident content for the most recent trigger of the incident for the current detected object. It originates from the pre-set content in the monitor rule configuration and represents the content of the most recent trigger.

Historical Trigger Trend Chart

This trend is displayed using a Window function. The historical trend of the detection result values shows the actual data from the last 60 detections.

Based on the current detection result value of the unrecovered incident, the historical trend of incident anomalies is displayed. The trigger threshold condition value configured in the monitor detection rule is set as a clear reference line. The system specifically marks the detection result from the most recent trigger of the incident for the current detected object. Furthermore, through the vertical lines in the trend chart, you can quickly pinpoint the exact time of the incident trigger. The corresponding detection interval for this result is also displayed, providing an intuitive analysis tool to assess the development process and impact of the incident.

Management Card

Display Items

The Unrecovered Incidents list supports the following display styles:

  • Standard: Displays the incident title, detection dimension, and incident content.
  • Expanded: In addition to standard information, also shows the historical trend of the detection result values for the unrecovered incident.
  • List: Displays incident data in a list format.

View Only Incidents Associated with Issues

After checking this option, you can filter with one click to show all incidents in the current list that are associated with an Issue.

For a single incident with an associated relationship, click the icon on the right side of the incident data to jump directly to view it:

Issue & Create New Issue

Create an Issue for an unrecovered incident to notify relevant members for timely handling.

  • List mode:

  • Standard/Expanded mode:

  • Incident details:

Mute Incident

In large-scale monitoring scenarios, to avoid the tedious steps, time consumption, and susceptibility to omissions associated with manually handling a large number of similar alerts, you can directly "Mute" the rule on the current page.

  1. Hover over a single incident and click Mute on the right side.
  2. Select the Mute time type.
  3. Confirm.

Mute Time Type

Supports customizing the start time and end time for muting, or quick setting to 1 hour, 6 hours, 12 hours, 1 day, 1 week.


  1. Select the start time and duration for muting.
  2. Select the mute cycle starting from a specific moment.
  3. Select the expiration time for the mute. You can choose to repeat forever according to the above time or repeat until a specific moment.

Recover Incident

An incident is considered recovered when its status is normal (df_sub_status = ok).

  • To recover a single rule, you can do so via the button on the right side of the rule, or go to the Monitor settings, or recover it manually.
  • If you click "Recover All", all abnormal incidents in the current list will be recovered, and you can choose whether to associate Issues.

Incident recovery is divided into four types:

Name
df_status Description
Recover ok If the previously detected "Critical", "Error", or "Warning" abnormal incidents are not triggered again within N subsequent detections, it is considered recovered.
No Data Recovered ok Data stopped being reported and then resumed reporting, judged as recovered.
No Data Treated as Recovered ok Detection data interruption is treated as a normal state.
Manual Recovery ok User manually clicks to recover, supporting single/batch recovery.

Further Reading

Feedback

Is this page helpful? ×