Skip to content

Threshold Detection

Used to monitor anomalies in Metrics, LOG, infrastructure, Resource Catalog, events, APM, RUM, and other data. Rules can set thresholds, and when exceeded, the system triggers alerts and notifies relevant personnel. It also supports multi-metric detection, with different alert levels configurable for each metric. For example, monitoring whether host memory usage is abnormally high.

Detection Configuration

Detection Frequency

The execution frequency of the detection rule; default is 5 minutes.

In addition to the specific options provided by the system, you can also input custom crontab tasks to configure scheduled tasks based on seconds, minutes, hours, days, months, and weeks.

  • When using custom Crontab detection frequency, the detection intervals include the last 1 minute, last 5 minutes, last 15 minutes, last 30 minutes, last 1 hour, last 6 hours, last 12 hours, and last 24 hours.

Detection Interval

The time range for querying the detection metrics. The available detection intervals vary depending on the detection frequency.

Detection Frequency Detection Interval (Dropdown Options)
30s 1m/5m/15m/30m/1h/3h
1m 1m/5m/15m/30m/1h/3h
5m 5m/15m/30m/1h/3h
15m 15m/30m/1h/3h/6h
30m 30m/1h/3h/6h
1h 1h/3h/6h/12h/24h
6h 6h/12h/24h
12h 12h/24h
24h 24h

Detection Metrics

  1. Data Type: The type of data currently being detected, including Metrics, LOG, infrastructure, Resource Catalog, events, APM, RUM, and network data types.

  2. Aggregation Algorithms: Provides various aggregation algorithms, including:

    • Avg by (average)
    • Min by (minimum)
    • Max by (maximum)
    • Sum by (sum)
    • Last (last value)
    • First by (first value)
    • Count by (data points)
    • Count_distinct by (unique data points)
    • p50 (median value)
    • p75 (75th percentile)
    • p90 (90th percentile)
    • p99 (99th percentile)
    • ......
    Note

    When selecting the transformation functions derivative, difference, non_negative_derivative, or non_negative_difference, an interval must be added. For example: [::5m].

  3. Detection Dimensions: You can configure string-type (keyword) fields in the data as detection dimensions, with support for up to three fields. By combining multiple detection dimension fields, specific detection objects can be determined. The system will determine whether the statistical metrics of these objects meet the trigger conditions, and if so, generate events.

    • For example, selecting detection dimensions host and host_ip, the detection object can be represented as {host: host1, host_ip: 127.0.0.1}. When the detection object is "LOG", the default detection dimensions are status, host, service, source, and filename.
  4. Filter Conditions: Filter the detection data based on metric tags to limit the detection scope. Supports adding one or more tag filter conditions, with fuzzy matching and non-matching filters supported for non-metric data.

  5. Alias: Customize the detection metric name.

  6. Query Methods: Supports simple queries, expression queries, PromQL queries, and data source queries.

Cross-Workspace Query Metrics

After authorization, you can select detection metrics from other workspaces under the current account. Once the monitor rule is successfully created, cross-workspace alert configurations can be achieved.

Note

After selecting another workspace, the detection metric dropdown options will only display data types that have been authorized in the current workspace.

Trigger Conditions

Set the trigger conditions for alert levels: You can configure any one of the trigger conditions for critical, error, warning, or normal.

Configure trigger conditions and severity. When the query result has multiple values, an event is generated if any value meets the trigger condition.

For more details, refer to Event Level Description.

Continuous Trigger Judgment

If continuous trigger judgment is enabled, you can configure the system to generate an event after the trigger condition is met multiple times consecutively. The maximum limit is 10 times.

Bulk Alert Protection

Enabled by default.

When the number of alerts generated in a single detection exceeds the preset threshold, the system automatically switches to a status summary strategy: instead of processing each alert object individually, it generates a small number of summary alerts based on event status and pushes them.

This ensures the timeliness of notifications while significantly reducing alert noise, avoiding the risk of timeout due to processing too many alerts.

Note

When this switch is enabled, the event details generated by subsequent monitor detections will not display historical records and associated events.

Alert Levels

  1. Alert Levels Critical (red), Error (orange), Warning (yellow): Based on the configured condition judgment operators.

    For more operator details, refer to Operator Description;

    For the likeTrue and likeFalse truth table details, refer to Truth Table Description.

  2. Alert Level Normal (green): Based on the configured number of detections, as explained below:

    • Each execution of a detection task counts as 1 detection, e.g., detection frequency = 5 minutes, then 1 detection = 5 minutes;
    • You can customize the number of detections, e.g., detection frequency = 5 minutes, then 3 detections = 15 minutes.
    Level Description
    Normal After the detection rule takes effect, if critical, error, or warning abnormal events are generated, and the data detection result returns to normal within the configured custom detection count, a recovery alert event is generated.
    ❗️ Recovery alert events are not subject to alert silence restrictions. If the recovery alert event detection count is not set, the alert event will not recover and will remain in the Events > Unrecovered Events List.

Recovery Conditions

After enabling recovery conditions, you can set recovery conditions and severity for the current explorer. When the query result has multiple values, a recovery event is generated if any value meets the trigger condition.

When the alert level changes from low to high, the corresponding level alert event is sent; when returning to normal, a normal recovery event is sent.

Note
  • Recovery conditions are only displayed when all trigger conditions are >, >=, <, <=;
  • The corresponding level recovery threshold must be less than the trigger threshold (e.g., critical recovery threshold < critical trigger threshold).

Recovery Alert Logic

After enabling "Recovery Conditions", the system uses Fault ID (fault ID) as a unique identifier to manage the entire lifecycle of the alert (including creating Issues, etc.).

When hierarchical recovery is also enabled:

  • The platform configures a separate set of recovery rules (i.e., recovery thresholds) for each alert level (e.g., critical, warning)

  • The alert and recovery status of each level is calculated independently

  • It does not affect the original Fault ID identified alert lifecycle

Therefore, when the monitor triggers an alert for the first time (i.e., starting a new alert lifecycle), the system simultaneously generates two alert messages. They appear similar because:

  1. The first alert source: overall detection (check), representing the start of the entire fault lifecycle (based on the original rule);

  2. The second alert source: hierarchical detection (critical/error/warning/…), indicating that the enabled hierarchical recovery function has started, used to present the specific alert level and its subsequent recovery status (e.g., critical_ok).

In the above, the df_monitor_checker_sub field is the core basis for distinguishing the two types of alerts:

  • check: represents the result of the overall detection;

  • Other values (e.g., critical, error, warning, etc.): correspond to the results of the hierarchical detection rules.

Therefore, when an alert is triggered for the first time, two records will appear, with similar content but different sources and purposes.

df_monitor_checker_sub T+0 T+1 T+2 T+3
check check error warning ok
critical critical critical_ok
error error error_ok
warning warning warning_ok

Data Gap

For data gap status, seven strategies can be configured.

  1. Link to the detection interval time range, judge the query result of the detection metric in the last few minutes, do not trigger events;

  2. Link to the detection interval time range, judge the query result of the detection metric in the last few minutes, treat the query result as 0; at this time, the query result will be re-compared with the thresholds configured in the Trigger Conditions above to determine whether to trigger abnormal events.

  3. Custom fill the detection interval value, trigger data gap events, trigger critical events, trigger error events, trigger warning events, and trigger recovery events; when selecting this type of configuration strategy, it is recommended that the custom data gap time configuration >= detection interval time interval, if the configured time <= detection interval time interval, there may be cases where both data gap and abnormal conditions are met, in which case only the data gap processing result will be applied.

Information Generation

When this option is enabled, detection results that do not match the above trigger conditions will generate "information" events.

Note

If trigger conditions, data gap, and information generation are configured simultaneously, the triggering priority is as follows: data gap > trigger conditions > information event generation.

Other Configurations

For more details, refer to Rule Configuration.

Feedback

Is this page helpful? ×