APM Metrics Anomaly Detection¶
Used to monitor key metrics data of APM within the workspace. The system counts the number of traces meeting the conditions within a specified time period and triggers an anomaly event when it exceeds a custom threshold.
Detection Configuration¶
Detection Frequency¶
The execution frequency of the detection rule.
Detection Interval¶
The time range for querying metrics each time the task is executed. The selectable detection intervals vary depending on the detection frequency.
| Detection Frequency | Detection Interval (Dropdown Options) |
|---|---|
| 30s | 1m/5m/15m/30m/1h/3h |
| 1m | 1m/5m/15m/30m/1h/3h |
| 5m | 5m/15m/30m/1h/3h |
| 15m | 15m/30m/1h/3h/6h |
| 30m | 30m/1h/3h/6h |
| 1h | 1h/3h/6h/12h/24h |
| 6h | 6h/12h/24h |
| 12h | 12h/24h |
| 24h | 24h |
Detection Metrics¶
Set the metrics for detection data, which can be used to configure the metrics data of services within the workspace for a specified time range.
| Field | Description |
|---|---|
| Service | Allows monitoring of APM services within the current workspace. |
| Metric | Specific detection metrics, including request count, error request count, request error rate, average requests per second, average response time, P50 response time, P75 response time, P90 response time, P99 response time, etc. |
| Filter Conditions | Filter detection data based on metric tags to limit the detection scope. Supports adding one or multiple tag filters, including fuzzy match and fuzzy non-match conditions. |
| Detection Dimension | Any string-type (keyword) field in the configured data can be selected as a detection dimension, with a current maximum of three fields. By combining multiple detection dimension fields, a specific detection object can be identified. The system determines whether the statistical metrics of this detection object meet the threshold of the trigger condition; if so, an event is generated.For example, selecting detection dimensions host and host_ip means the detection object can be {host: host1, host_ip: 127.0.0.1}. |
Counts the number of traces meeting the conditions within a specified time period and triggers an anomaly event when it exceeds a custom threshold. Can be used for notifications of service trace anomalies and errors.
| Field | Description |
|---|---|
| Source | The data source of the current detection metric. |
| Filter Conditions | Filter trace span through tags to limit the scope of detected data. Supports adding one or multiple tag filter conditions. |
| Aggregation Algorithm | Defaults to selecting "*", corresponding to the count aggregation function. If another field is selected, the aggregation function automatically changes to count distinct (counting distinct occurrences of the keyword in the data points). |
| Detection Dimension | Any string-type (keyword) field in the configured data can be selected as a detection dimension, with a current maximum of three fields. By combining multiple detection dimension fields, a specific detection object can be identified. The system determines whether the statistical metrics of this detection object meet the threshold of the trigger condition; if so, an event is generated.For example, selecting detection dimensions host and host_ip means the detection object can be {host: host1, host_ip: 127.0.0.1}. |
Trigger Conditions¶
Set the trigger conditions for alert severity levels: You can configure any one of the trigger conditions for Critical, Important, Warning, or Normal.
Configure the trigger conditions and severity. When the query result contains multiple values, an event is generated if any value meets the trigger condition.
For more details, refer to Event Level Description.
Consecutive Trigger Judgment¶
If Consecutive Trigger Judgment is enabled, it means an event is generated only after the trigger condition is met consecutively for a specified number of times. The maximum limit is 10 times.
Bulk Alert Protection¶
Enabled by default.
When the number of alerts generated in a single detection exceeds a preset threshold, the system automatically switches to a status summary strategy: instead of processing each alert object individually, it generates a small number of summary alerts based on event status and pushes them.
This ensures the timeliness of notifications while significantly reducing alert noise and avoiding timeout risks caused by processing too many alerts.
Note
When this switch is enabled, the subsequent Event Details generated by the monitor for such anomalies will not display historical records and associated events.
Alert Level¶
-
Alert Level Critical (red), Important (orange), Warning (yellow);
-
Alert Level Normal (green): Based on the configured number of detection times, explained as follows:
- Each execution of a detection task counts as 1 detection. For example, if
Detection Frequency = 5 minutes, then 1 detection = 5 minutes. - The number of detections can be customized. For example, if
Detection Frequency = 5 minutes, then 3 detections = 15 minutes.
Level Description Normal After the detection rule takes effect, if the data detection result returns to normal within the configured custom number of detections after a Critical, Important, or Warning anomaly event is generated, a recovery alert event is generated.
❗️ Recovery alert events are not subject to Alert Silence restrictions. If the number of detections for recovery alert events is not set, the alert event will not recover and will remain in the Events > Unrecovered Events List. - Each execution of a detection task counts as 1 detection. For example, if
Data Gap¶
Seven strategies can be configured for data gap status.
- Link with the detection interval time range, judge the query result of the detection metric for the most recent minutes, and do not trigger an event.
- Link with the detection interval time range, judge the query result of the detection metric for the most recent minutes, and treat the query result as 0. The query result will then be re-compared with the threshold configured in the Trigger Conditions above to determine whether to trigger an anomaly event.
- Custom fill the detection interval value, trigger a data gap event, trigger a critical event, trigger an important event, trigger a warning event, and trigger a recovery event. For this type of configuration strategy, the recommended custom data gap time configuration is >= the detection interval time span. If the configured time is <= the detection interval time span, situations satisfying both data gap and anomaly conditions may occur; in such cases, only the data gap processing result will be applied.
Information Generation¶
Enable this option to generate "Information" events for detection results that do not match any of the above trigger conditions and write them.
Note
When Trigger Conditions, Data Gap, and Information Generation are configured simultaneously, the triggering is judged according to the following priority: Data Gap > Trigger Conditions > Information Event Generation.
Other Configurations¶
For more details, refer to Rule Configuration.