Threshold Detection¶
Used to monitor anomalies in Metrics, LOG, infrastructure, Resource Catalog, events, APM, RUM, and other data. Rules can set thresholds, and when exceeded, the system triggers alerts and notifies relevant personnel. It also supports multi-metric detection, with different alert levels configurable for each metric. For example, monitoring whether host memory usage is abnormally high.
Detection Configuration¶
Detection Frequency¶
The execution frequency of the detection rule; default is 5 minutes.
In addition to the specific options provided by the system, you can also input custom crontab tasks to configure scheduled tasks based on seconds, minutes, hours, days, months, and weeks.
- When using custom Crontab detection frequency, the detection intervals include the last 1 minute, last 5 minutes, last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, and last 24 hours.
Detection Interval¶
The time range for querying the detection metrics. The available detection intervals vary depending on the detection frequency.
| Detection Frequency | Detection Interval (Dropdown Options) |
|---|---|
| 30s | 1m/5m/15m/30m/1h/3h |
| 1m | 1m/5m/15m/30m/1h/3h |
| 5m | 5m/15m/30m/1h/3h |
| 15m | 15m/30m/1h/3h/6h |
| 30m | 30m/1h/3h/6h |
| 1h | 1h/3h/6h/12h/24h |
| 6h | 6h/12h/24h |
| 12h | 12h/24h |
| 24h | 24h |
Detection Metrics¶
-
Data Type: The type of data currently being detected, including Metrics, LOG, infrastructure, Resource Catalog, events, APM, RUM, and network data types.
-
Aggregation Algorithms: Provides various aggregation algorithms, including:
- Avg by (average)
- Min by (minimum)
- Max by (maximum)
- Sum by (sum)
- Last (last value)
- First by (first value)
- Count by (data points)
- Count_distinct by (unique data points)
- p50 (median value)
- p75 (75th percentile)
- p90 (90th percentile)
- p99 (99th percentile)
- ......
Note
When selecting the transformation functions
derivative,difference,non_negative_derivative, ornon_negative_difference, anintervalmust be added. For example:[::5m]. -
Detection Dimensions: You can configure string-type (
keyword) fields in the data as detection dimensions, with support for up to three fields. By combining multiple detection dimension fields, specific detection objects can be determined. The system will determine whether the statistical metrics of these objects meet the trigger conditions, and if so, generate events.- For example, selecting detection dimensions
hostandhost_ip, the detection object can be represented as {host: host1, host_ip: 127.0.0.1}. When the detection object is "LOG", the default detection dimensions arestatus,host,service,source, andfilename.
- For example, selecting detection dimensions
-
Filter Conditions: Filter the detection data based on metric tags to limit the detection scope. Supports adding one or more tag filter conditions, with fuzzy matching and non-matching filters supported for non-metric data.
-
Alias: Customize the detection metric name.
-
Query Methods: Supports simple queries, expression queries, PromQL queries, and data source queries.
Cross-Workspace Query Metrics¶
After authorization, you can select detection metrics from other workspaces under the current account. Once the monitor rule is successfully created, cross-workspace alert configurations can be achieved.
Note
After selecting another workspace, the detection metric dropdown options will only display data types that have been authorized in the current workspace.
Trigger Conditions¶
Set the trigger conditions for alert levels: You can configure any one of the trigger conditions for critical, error, warning, or normal.
Configure trigger conditions and severity. When the query result has multiple values, an event is generated if any value meets the trigger condition.
For more details, refer to Event Level Description.
Continuous Trigger Judgment¶
If continuous trigger judgment is enabled, you can configure the system to generate an event after the trigger condition is met multiple times consecutively. The maximum limit is 10 times.
Bulk Alert Protection¶
Enabled by default.
When the number of alerts generated in a single detection exceeds the preset threshold, the system automatically switches to a status summary strategy: instead of processing each alert object individually, it generates a small number of summary alerts based on event status and pushes them.
This ensures the timeliness of notifications while significantly reducing alert noise, avoiding the risk of timeout due to processing too many alerts.
Note
When this switch is enabled, the event details generated by subsequent monitor detections will not display historical records and associated events.
Alert Levels¶
-
Alert Levels Critical (red), Error (orange), Warning (yellow): Based on the configured condition judgment operators.
For more operator details, refer to Operator Description;
For the
likeTrueandlikeFalsetruth table details, refer to Truth Table Description. -
Alert Level Normal (green): Based on the configured number of detections, as explained below:
- Each execution of a detection task counts as 1 detection, e.g.,
detection frequency = 5 minutes, then 1 detection = 5 minutes; - You can customize the number of detections, e.g.,
detection frequency = 5 minutes, then 3 detections = 15 minutes.
Level Description Normal After the detection rule takes effect, if critical, error, or warning abnormal events are generated, and the data detection result returns to normal within the configured custom detection count, a recovery alert event is generated.
❗️ Recovery alert events are not subject to alert silence restrictions. If the recovery alert event detection count is not set, the alert event will not recover and will remain in the Events > Unrecovered Events List. - Each execution of a detection task counts as 1 detection, e.g.,
Recovery Conditions¶
After enabling recovery conditions, you can set recovery conditions and severity for the current explorer. When the query result has multiple values, a recovery event is generated if any value meets the trigger condition.
When the alert level changes from low to high, the corresponding level alert event is sent; when returning to normal, a normal recovery event is sent.
Note
- Recovery conditions are only displayed when all trigger conditions are >, >=, <, <=;
- The corresponding level recovery threshold must be less than the trigger threshold (e.g., critical recovery threshold < critical trigger threshold).
Recovery Alert Logic¶
After enabling "Recovery Conditions", the system uses Fault ID (fault ID) as a unique identifier to manage the entire lifecycle of the alert (including creating Issues, etc.).
When hierarchical recovery is also enabled:
-
The platform configures a separate set of recovery rules (i.e., recovery thresholds) for each alert level (e.g.,
critical,warning) -
The alert and recovery status of each level is calculated independently
-
It does not affect the original Fault ID identified alert lifecycle
Therefore, when the monitor triggers an alert for the first time (i.e., starting a new alert lifecycle), the system simultaneously generates two alert messages. They appear similar because:
-
The first alert source: overall detection (
check), representing the start of the entire fault lifecycle (based on the original rule); -
The second alert source: hierarchical detection (
critical/error/warning/…), indicating that the enabled hierarchical recovery function has started, used to present the specific alert level and its subsequent recovery status (e.g.,critical_ok).
In the above, the df_monitor_checker_sub field is the core basis for distinguishing the two types of alerts:
-
check: represents the result of the overall detection; -
Other values (e.g.,
critical,error,warning, etc.): correspond to the results of the hierarchical detection rules.
Therefore, when an alert is triggered for the first time, two records will appear, with similar content but different sources and purposes.
df_monitor_checker_sub |
T+0 | T+1 | T+2 | T+3 |
|---|---|---|---|---|
check |
check |
error |
warning |
ok |
critical |
critical |
critical_ok |
||
error |
error |
error_ok |
||
warning |
warning |
warning_ok |
Data Gap¶
For data gap status, seven strategies can be configured.
-
Link to the detection interval time range, judge the query result of the detection metric in the last few minutes, do not trigger events;
-
Link to the detection interval time range, judge the query result of the detection metric in the last few minutes, treat the query result as 0; at this time, the query result will be re-compared with the thresholds configured in the Trigger Conditions above to determine whether to trigger abnormal events.
-
Custom fill the detection interval value, trigger data gap events, trigger critical events, trigger error events, trigger warning events, and trigger recovery events; when selecting this type of configuration strategy, it is recommended that the custom data gap time configuration >= detection interval time interval, if the configured time <= detection interval time interval, there may be cases where both data gap and abnormal conditions are met, in which case only the data gap processing result will be applied.
Information Generation¶
When this option is enabled, detection results that do not match the above trigger conditions will generate "information" events.
Note
If trigger conditions, data gap, and information generation are configured simultaneously, the triggering priority is as follows: data gap > trigger conditions > information event generation.
Other Configurations¶
For more details, refer to Rule Configuration.
