Skip to content

Threshold Detection


Current Document Location

This document is the second step in the detection rule configuration process. After completing the configuration, please return to the main document to continue with the third step: Event Notification.

Data Scope: Supports all data types, such as Metrics M, Logs L, Traces T, RUM data R, Objects O, etc.).

Detection Configuration

Detection Frequency

Set the time period for executing detection.

  • Preset Options: 1 minute, 5 minutes, 10 minutes, 15 minutes, 30 minutes, 1 hour

  • Crontab Mode: Click "Switch to Crontab Mode" to configure a custom schedule. Supports configuring scheduled task execution based on seconds, minutes, hours, days, months, weeks, etc.

Detection Interval

Set the data time range queried for each detection (❗️The detection interval should be greater than or equal to the detection frequency, and should match the actual data reporting cycle to avoid missed detections or false alarms).

  • Preset Options:
Detection Frequency Detection Interval (Dropdown Options)
30s 1m/5m/15m/30m/1h/3h
1m 1m/5m/15m/30m/1h/3h
5m 5m/15m/30m/1h/3h
15m 15m/30m/1h/3h/6h
30m 30m/1h/3h/6h
1h 1h/3h/6h/12h/24h
6h 6h/12h/24h
12h 12h/24h
24h 24h
  • Custom Format: Custom input for detection interval, e.g., 20m (last 20 minutes), 2h (last 2 hours), 1d (last 1 day).

Detection Metric

Define the detection data source and aggregation method based on DQL.

Please avoid selecting high-cardinality fields as detection dimensions. Improper configuration with overly lenient trigger conditions may lead to frequent alerts. The current query returns a maximum of 100,000 records.

Configuration Example: When configuring the detection metric for a monitor, by trace_id was selected as the grouping field.

The system will then prompt you:

The current grouping field trace_id is a high-cardinality risk field. Using such fields for aggregation will generate a massive number of groups, which can easily trigger alert explosions, leading to the monitor being automatically paused by the system. Please optimize the grouping settings.

Configuration Elements

Configuration Item Description
Workspace Defaults to the current workspace, can be switched to other authorized workspaces

After authorization, you can use detection metrics from other workspaces under the current account to create monitors. Once the rule is created successfully, cross-workspace alert configuration is achieved. Note that when you select another workspace, the data type dropdown list for detection metrics will only display data types that the current workspace is authorized to use.
Data Source Type Metrics, Logs, Traces, RUM data, etc.
Query Method Simple Query, Expression Query, PromQL Query, Data Source Query
Filter Conditions Filter detection objects via tags
Aggregation Method avg, sum, max, min, count, etc.
Function Support Click the fx button to support selecting the following functions:

  • Label Filtering: Filter or group based on tag values
  • Conversion Functions: Data type conversion or format conversion
  • Selection Functions: first, last, max, min
  • No-data Filling: Previous value fill, linear fill, value fill

    ❗️When selecting conversion functions derivative, difference, non_negative_derivative, non_negative_difference, an interval needs to be added. For example: [::5m]
  • Click to view Detailed Explanation of Query Methods, More Instructions on Function Usage.

    Trigger Conditions

    Configure trigger conditions and severity levels. When the query result contains multiple values, an event is generated if any value meets the trigger condition.

    Supports configuring four-level thresholds: Critical, Error, Warning, Info, and OK recovery conditions.

    Level Configuration Description
    Critical When Result >= [value] Highest level alert, requires immediate action
    Error When Result >= [value] High-level alert, requires priority handling
    Warning When Result >= [value] Medium-level alert, requires attention
    Info When Result >= [value] Low-level alert, needs to be noted
    OK No events generated for [N] consecutive detections If the detection metric triggers "Critical", "Error", "Warning", "Info" abnormal events, and then N consecutive detections are normal, an "OK" event is generated. Used to determine if an abnormal event has returned to normal, recommended for configuration

    For more details, refer to Event Level Description.

    Advanced Options

    Consecutive Trigger Judgment

    When enabled, events are generated only when trigger conditions are continuously met, avoiding false alarms from transient fluctuations (❗️Maximum configuration limit is 10 times).

    Bulk Alert Protection

    Enabled by default in the system.

    When the number of alerts generated by a single detection exceeds a preset threshold, the system automatically switches to a status summary strategy: instead of processing each alert object individually, it generates a small number of summary alerts based on event status and pushes them.

    This ensures timely notification while significantly reducing alert noise and avoiding timeout risks caused by processing too many alerts.

    When this switch is enabled, subsequent event details generated by this monitor after detecting anomalies will not display historical records and associated events.

    Recovery Conditions

    Configure recovery conditions and severity levels. When the query result contains multiple values, a recovery event is generated if any value meets the trigger condition.

    Set independent recovery thresholds for different levels to achieve downgraded recovery. For example: a Critical alert recovers when the value drops below 70, while a Warning alert recovers below 80.

    Recovery Alert Logic

    After enabling "Recovery Conditions", the system uses the Fault ID as a unique identifier to manage the entire lifecycle of an alert (including operations like creating Issues).

    When hierarchical recovery is also enabled:

    • The platform configures a separate set of recovery rules (i.e., recovery thresholds) for each alert level (e.g., critical, warning)

    • The alert status and recovery status for each level are calculated independently

    • This does not affect the alert lifecycle identified by the original Fault ID

    Therefore, when a monitor triggers an alert for the first time (i.e., starting a new alert lifecycle), the system simultaneously generates two alert messages. They appear similar because:

    1. Source of the first alert: Overall detection (check), representing the start of the entire fault lifecycle (based on the original rule);

    2. Source of the second alert: Hierarchical detection (critical/error/warning/…), indicating that the enabled hierarchical recovery function has started, used to present the specific alert level and its subsequent recovery status (e.g., critical_ok).

    In the above, the df_monitor_checker_sub field is the core basis for distinguishing the two types of alerts:

    • check: Represents the result of the overall detection;

    • Other values (e.g., critical, error, warning, etc.): Correspond to the results of hierarchical detection rules.

    Thus, when an alert is first triggered, two records appear with similar content but different sources and purposes.

    df_monitor_checker_sub T+0 T+1 T+2 T+3
    check check error warning ok
    critical critical critical_ok
    error error error_ok
    warning warning warning_ok

    Data Gap

    The handling strategy when the detection metric query result is empty within the detection interval:

    Option Description
    Do Not Trigger Event (Default) No alert is generated when data is missing, suitable for scenarios where data gaps are allowed
    Treat Query Result as 0 Treats empty data as a value of 0 for threshold judgment
    Trigger Data Gap Event Treats no data as an anomaly, triggering a data gap event
    Trigger Critical Event Triggers a Critical level event when data is missing
    Trigger Error Event Triggers an Error level event when data is missing
    Trigger Warning Event Triggers a Warning level event when data is missing
    Trigger Info Event Triggers an Info level event when data is missing
    Trigger Recovery Event Triggers a recovery event when data is missing

    When trigger conditions, data gap handling, and info generation are configured simultaneously, the triggering priority is judged as follows: Data Gap > Trigger Conditions > Info Event Generation.

    That is: first judge if there is a data gap, then judge if thresholds are triggered, and finally judge if info events should be generated.

    Info Generation

    When this option is enabled, the system writes all detection results that do not match the above trigger conditions as "Info" events.

    Suitable for scenarios where recording normal status changes or low-priority information is needed.

    Subsequent Configuration

    After completing the above detection configuration, please continue to configure:

    1. Event Notification: Define event title, content, notification members, data gap handling, and associated faults;

    2. Alert Configuration: Select alert strategies, set notification targets, and mute periods;

    3. Association: Associate dashboards for quick jump to view data;

    4. Permissions: Set operation permissions to control who can edit/delete this monitor.

    Feedback

    Is this page helpful? ×