APM Metrics Anomaly Detection¶
Current Document Location
This document is the second step in the detection rule configuration process. After configuration, please return to the main document to continue with the third step: Event Notification.
Data Scope: Traces (T), used to monitor key metric data of APM within the workspace. The system counts the number of traces meeting the conditions within a specified time period and triggers an anomaly event when it exceeds a custom threshold.
Detection Configuration¶
Detection Frequency¶
Set the time cycle for executing detection.
-
Preset Options: 1 minute, 5 minutes, 10 minutes, 15 minutes, 30 minutes, 1 hour
-
Crontab Mode: Click "Switch to Crontab Mode" to configure a custom cycle, supporting scheduled task execution based on seconds, minutes, hours, days, months, weeks, etc.
Detection Interval¶
Set the data time range queried for each detection (❗️The detection interval should be greater than or equal to the detection frequency and should match the actual data reporting cycle to avoid missed detection or false alarms).
| Detection Frequency | Detection Interval (Dropdown Options) |
|---|---|
| 30s | 1m/5m/15m/30m/1h/3h |
| 1m | 1m/5m/15m/30m/1h/3h |
| 5m | 5m/15m/30m/1h/3h |
| 15m | 15m/30m/1h/3h/6h |
| 30m | 30m/1h/3h/6h |
| 1h | 1h/3h/6h/12h/24h |
| 6h | 6h/12h/24h |
| 12h | 12h/24h |
| 24h | 24h |
- Custom Format: Custom input for detection interval, e.g.,
20m(last 20 minutes),2h(last 2 hours),1d(last 1 day).
Detection Metrics¶
Set the metrics for detection data, supporting two detection modes:
-
Service Metrics
-
Trace Statistics
Note
Please avoid selecting high-cardinality fields as detection dimensions. Improper configuration with overly lenient trigger conditions may lead to frequent alerts. The current query returns a maximum of 100,000 records.
Service Metrics¶
Monitor the APM services within the current workspace.
| Configuration Item | Description |
|---|---|
| Service | Select APM services within the current workspace, supporting "All" or specified services |
| Metric | Specific detection metrics, including: Request Count, Error Request Count, Request Error Rate, Average Requests Per Second, Average Response Time, P50 Response Time, P75 Response Time, P90 Response Time, P99 Response Time |
| Filter Conditions | Filter detection data based on metric tags to limit the detection scope. Supports adding one or multiple tag filters, including fuzzy match and fuzzy non-match conditions |
| Detection Dimensions | Any string-type (keyword) field in the configured data can be selected as a detection dimension, currently supporting up to three fields. The combination of multiple detection dimension fields can define a specific detection object (e.g., {service: svc1, host: host1}) |
| Additional Information | Select field information to be additionally displayed for enriching event content |
Trace Statistics¶
Count the number of traces (Spans) meeting the conditions within a specified time period, triggering an anomaly event when exceeding a custom threshold. Can be used for notifications of service trace anomalies and errors.
| Configuration Item | Description |
|---|---|
| Source | Select the source (service) of trace data to be counted, supports keyword filtering |
| Filter Conditions | Filter trace span through tags to limit the data scope for detection. Supports adding one or multiple tag filter conditions |
| Aggregation Algorithm | Default selection is "*", corresponding to the count aggregation function (counting the number of Spans). If another field is selected, the aggregation function automatically changes to count distinct (counting distinct data points where the keyword appears, i.e., deduplicated count) |
| Detection Dimensions | Any string-type (keyword) field in the configured data can be selected as a detection dimension, currently supporting up to three fields. The combination of multiple detection dimension fields can define a specific detection object |
Trigger Conditions¶
Configure trigger conditions and severity levels. When the query result contains multiple values, an event is generated if any value meets the trigger condition.
Supports configuring four levels of thresholds: Fatal, Severe, Important, Warning, and a Normal recovery condition.
| Level | Configuration | Description |
|---|---|---|
| Fatal | When Result >= [value] |
Highest level alert, requires immediate action |
| Severe | When Result >= [value] |
High-level alert, requires priority handling |
| Important | When Result >= [value] |
Medium-level alert, requires attention |
| Warning | When Result >= [value] |
Low-level alert, requires awareness |
| Normal | No events generated for [N] consecutive detections |
If the detection metric triggers "Fatal", "Severe", "Important", or "Warning" anomaly events, and then N consecutive detections are normal, a "Normal" event is generated. Used to determine if an anomaly event has returned to normal, recommended for configuration |
For more details, refer to Event Level Description.
Advanced Options¶
Consecutive Trigger Judgment¶
When enabled, events are generated only when the trigger condition is continuously met, avoiding false alarms from transient fluctuations (❗️Maximum configuration limit is 10 times).
Bulk Alert Protection¶
Enabled by default.
When the number of alerts generated in a single detection exceeds a preset threshold, the system automatically switches to a status summary strategy: instead of processing each alert object individually, it generates a small number of summary alerts based on event status and pushes them.
This ensures notification timeliness while significantly reducing alert noise and avoiding timeout risks from processing too many alerts.
When this switch is enabled, subsequent event details generated by the monitor after detecting anomalies will not display historical records and related events.
Recovery Conditions¶
Configure recovery conditions and severity levels. When the query result contains multiple values, a recovery event is generated if any value meets the trigger condition.
Set independent recovery thresholds for different levels to achieve downgraded recovery. For example: a Severe alert recovers when the value drops below 70, while an Important alert recovers below 80.
Default Recovery Logic
When hierarchical recovery condition configuration is not enabled, recovery occurs automatically by default when the detection result no longer meets the trigger condition.
Data Gap¶
Processing strategy when the detection metric query result is empty within the detection interval:
| Option | Description |
|---|---|
| Do not trigger event (default) | No alarm is generated when data is missing, suitable for scenarios where data gaps are allowed |
| Treat query result as 0 | Treat empty data as a value of 0 for threshold judgment |
| Trigger data gap event | Treat missing data as an anomaly, triggering a data gap event |
| Trigger fatal event | Trigger a Fatal level event when data is missing |
| Trigger severe event | Trigger a Severe level event when data is missing |
| Trigger important event | Trigger an Important level event when data is missing |
| Trigger warning event | Trigger a Warning level event when data is missing |
| Trigger recovery event | Trigger a recovery event when data is missing |
When trigger conditions, data gap, and information generation are configured simultaneously, the triggering priority is judged as follows: Data Gap > Trigger Conditions > Information Event Generation.
That is: first judge if there is a data gap, then judge if thresholds are triggered, and finally judge if information events should be generated.
Information Generation¶
When this option is enabled, the system writes all detection results that do not match the above trigger conditions as "Information" events.
Suitable for scenarios requiring recording normal status changes or low-priority information.
Subsequent Configuration¶
After completing the above detection configuration, please continue to configure:
- Event Notification: Define event title, content, notification members, data gap handling, and associated incidents;
- Alert Configuration: Select alert strategies, set notification targets, and mute periods;
- Association: Associate dashboards for quick data viewing;
- Permissions: Set operation permissions to control who can edit/delete this monitor.