Skip to content

Threshold Detection


Used to monitor anomalies in Metrics, Logs, Infrastructure, Resource Catalog, Events, APM, RUM, and other data. Rules can be set with thresholds; when a threshold is exceeded, the system triggers an alert and notifies relevant personnel. It also supports multi-metric detection, allowing different alert levels to be configured for each metric. For example, monitoring whether the host memory usage rate is abnormally high.

Detection Configuration

Detection Frequency

This refers to the execution frequency of the detection rule; 5 minutes is selected by default.

In addition to the specific options provided by the system, you can also input a custom crontab task to configure scheduled task execution based on cycles such as seconds, minutes, hours, days, months, and weeks.

  • When using a custom Crontab detection frequency, the detection intervals include the last 1 minute, last 5 minutes, last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, and last 24 hours.

Detection Interval

This refers to the time range for querying the detection metric. The selectable detection intervals vary depending on the detection frequency.

Detection Frequency Detection Interval (Dropdown Options)
30s 1m/5m/15m/30m/1h/3h
1m 1m/5m/15m/30m/1h/3h
5m 5m/15m/30m/1h/3h
15m 15m/30m/1h/3h/6h
30m 30m/1h/3h/6h
1h 1h/3h/6h/12h/24h
6h 6h/12h/24h
12h 12h/24h
24h 24h

Detection Metric

  1. Data Type: The type of data currently being detected, including Metrics, Logs, Infrastructure, Resource Catalog, Events, APM, RUM, and Network data types.

  2. Aggregation Algorithm: Provides various aggregation algorithms, including:

    • Avg by (Average value)
    • Min by (Minimum value)
    • Max by (Maximum value)
    • Sum by (Summation)
    • Last (Last value)
    • First by (First value)
    • Count by (Number of data points)
    • Count_distinct by (Number of distinct data points)
    • p50 (Median value)
    • p75 (Value at the 75th percentile)
    • p90 (Value at the 90th percentile)
    • p99 (Value at the 99th percentile)
    • ......
    Note

    When selecting the transformation functions derivative, difference, non_negative_derivative, or non_negative_difference, an interval needs to be added. For example: [::5m].

  3. Detection Dimension: You can choose to configure string-type (keyword) fields from the data as detection dimensions, supporting up to three fields. By combining multiple detection dimension fields, a specific detection object can be identified. The system will determine whether the statistical metric of this object meets the trigger condition and generate an event if it does.

    • For example, selecting detection dimensions host and host_ip, the detection object can be represented as {host: host1, host_ip: 127.0.0.1}. When the detection object is "Logs", status, host, service, source, and filename are used as detection dimensions by default.
  4. Filter Condition: Filters the detection data based on the metric's tags to limit the detection scope. Supports adding one or multiple tag filter conditions. Non-metric data supports fuzzy matching and fuzzy non-matching filters.

  5. Alias: Custom name for the detection metric.

  6. Query Method: Supports Simple Query, Expression Query, PromQL Query, and Data Source Query.

Query Metrics Across Workspaces

After authorization, you can select detection metrics from other workspaces under the current account. Once the monitor rule is successfully created, cross-workspace alert configuration can be achieved.

Note

After selecting another workspace, the detection metric dropdown will only display data types that have been authorized for the current workspace.

Trigger Condition

Set the trigger condition for the alert level: You can configure any one of the trigger conditions: Critical, Error, Warning, OK.

Configure the trigger condition and severity. When the query result contains multiple values, an event is generated if any value meets the trigger condition.

For more details, refer to Event Level Description.

Consecutive Trigger Judgment

If Consecutive Trigger Judgment is enabled, you can configure that an event is generated only after the trigger condition is met for a consecutive number of times (maximum 10 times).

Bulk Alert Protection

Enabled by default.

When the number of alerts generated in a single detection exceeds a preset threshold, the system automatically switches to a status summary strategy: instead of processing each alert object individually, it generates a small number of summary alerts based on event status and pushes them.

This ensures the timeliness of notifications while significantly reducing alert noise and avoiding timeout risks caused by processing too many alerts.

Note

When this switch is enabled, the Event Details for such events generated after the monitor detects an anomaly will not display historical records and associated events.

Alert Level

  1. Alert Level Critical (red), Error (orange), Warning (yellow);

    For more details on operators, refer to Operator Description;

    For details on the truth table for likeTrue and likeFalse, refer to Truth Table Description.

  2. Alert Level OK (green): Based on the configured number of detections, explained as follows:

    • Each execution of a detection task counts as 1 detection. For example, if Detection Frequency = 5 minutes, then 1 detection = 5 minutes;
    • The number of detections can be customized. For example, if Detection Frequency = 5 minutes, then 3 detections = 15 minutes.
    Level Description
    OK After the detection rule takes effect and a Critical, Error, or Warning abnormal event is generated, if the data detection result returns to normal within the configured custom number of detections, a recovery alert event is generated.
    ❗️ Recovery alert events are not restricted by Alert Silence. If the number of detections for recovery alert events is not set, the alert event will not recover and will remain in the Events > Unrecovered Events List.

Recovery Conditions

After enabling Configure Recovery Conditions, you can set recovery conditions and severity for the current explorer. When the query result contains multiple values, a recovery event is generated if any value meets the trigger condition.

When the alert level changes from low to high, an alert event of the corresponding level is sent; when it recovers to normal, a normal recovery event is sent.

Note
  • Recovery conditions are only displayed here when all selected trigger conditions are >, >=, <, <=;
  • The recovery threshold for the corresponding level must be less than the trigger threshold (e.g., Critical recovery threshold < Critical trigger threshold).

Recovery Alert Logic

After enabling "Recovery Conditions", the system uses a Fault ID as a unique identifier to manage the entire lifecycle of an alert (including operations like creating an Issue).

When hierarchical recovery is also enabled:

  • The platform configures a separate set of recovery rules (i.e., recovery thresholds) for each alert level (e.g., critical, warning)

  • The alert status and recovery status for each level are calculated independently

  • The original Fault ID identified alert lifecycle is not affected

Therefore, when a monitor triggers an alert for the first time (i.e., starting a new alert lifecycle), the system simultaneously generates two alert messages. They appear similar because:

  1. The first alert source: Overall detection (check), representing the beginning of the entire fault lifecycle (based on the original rule);

  2. The second alert source: Hierarchical detection (critical/error/warning/…), indicating that the enabled hierarchical recovery function has started, used to present the specific alert level and its subsequent recovery status (e.g., critical_ok).

In the above, the df_monitor_checker_sub field is the key to distinguishing the two types of alerts:

  • check: Represents the result of the overall detection;

  • Other values (e.g., critical, error, warning, etc.): Correspond to the results of the hierarchical detection rules.

Therefore, when an alert is triggered for the first time, two records appear with similar content but different sources and purposes.

df_monitor_checker_sub T+0 T+1 T+2 T+3
check check error warning ok
critical critical critical_ok
error error error_ok
warning warning warning_ok

Data Gap

Seven strategies can be configured for data gap status.

  1. Linked with the detection interval time range, judge the query result for the recent number of minutes of the detection metric, do not trigger an event;

  2. Linked with the detection interval time range, judge the query result for the recent number of minutes of the detection metric, treat the query result as 0; at this time, the query result will be re-compared with the thresholds configured in the Trigger Condition above to determine whether to trigger an abnormal event.

  3. Custom fill the detection interval value, trigger data gap event, trigger critical event, trigger error event, trigger warning event, and trigger recovery event; when selecting this type of configuration strategy, the recommended configuration for the custom data gap time is >= detection interval time. If the configured time is <= the detection interval time, situations where both data gap and abnormal conditions are met may occur. In such cases, only the data gap processing result will be applied.

Information Generation

After enabling this option, detection results that do not match any of the above trigger conditions will generate an "Info" event and be written.

Note

When trigger conditions, data gap, and information generation are configured simultaneously, the triggering priority is judged as follows: Data Gap > Trigger Condition > Info Event Generation.

Other Configurations

For more details, refer to Rule Configuration.

Feedback

Is this page helpful? ×