APM Metrics Detection¶
Used to monitor key Metrics data of APM within the workspace. The system counts the number of traces that meet the conditions within the specified time period, and triggers an Incident when it exceeds the custom threshold.
Detection Configuration¶
Detection Frequency¶
The execution frequency of the detection rule.
Detection Interval¶
The time range for querying Metrics each time the task is executed. The available detection intervals vary depending on the detection frequency.
| Detection Frequency | Detection Interval (Dropdown Options) |
|---|---|
| 30s | 1m/5m/15m/30m/1h/3h |
| 1m | 1m/5m/15m/30m/1h/3h |
| 5m | 5m/15m/30m/1h/3h |
| 15m | 15m/30m/1h/3h/6h |
| 30m | 30m/1h/3h/6h |
| 1h | 1h/3h/6h/12h/24h |
| 6h | 6h/12h/24h |
| 12h | 12h/24h |
| 24h | 24h |
Detection Metrics¶
Set the Metrics for detection data, which can be used to configure the Metrics data of services within the workspace for a specified time range.
| Field | Description |
|---|---|
| Service | Monitor the APM services within the current workspace. |
| Metrics | Specific detection Metrics, including request count, error request count, request error rate, average requests per second, average response time, P50 response time, P75 response time, P90 response time, P99 response time, etc. |
| Filter Conditions | Filter detection data based on the tags of Metrics to limit the detection scope. Supports adding one or more tag filters, and also supports fuzzy matching and fuzzy non-matching filter conditions. |
| Detection Dimensions | Any string type (keyword) field in the configuration data can be selected as a detection dimension, currently supporting up to three fields. By combining multiple detection dimension fields, a specific detection object can be determined. The system will determine whether the statistical Metrics of this detection object meet the threshold of the trigger conditions, and if so, an Incident will be generated.For example, selecting detection dimensions host and host_ip, the detection object can be {host: host1, host_ip: 127.0.0.1}. |
Count the number of traces that meet the conditions within the specified time period, and trigger an Incident when it exceeds the custom threshold. This can be used for notifying abnormal errors in service traces.
| Field | Description |
|---|---|
| Source | The data source of the current detection Metrics. |
| Filter Conditions | Filter trace span through tags to limit the detection data scope. Supports adding one or more tag filter conditions. |
| Aggregation Algorithm | Default selected as “*”, corresponding to the aggregation function count. If other fields are selected, the aggregation function automatically changes to count distinct (taking the number of data points where the keyword appears). |
| Detection Dimensions | Any string type (keyword) field in the configuration data can be selected as a detection dimension, currently supporting up to three fields. By combining multiple detection dimension fields, a specific detection object can be determined. The system will determine whether the statistical Metrics of this detection object meet the threshold of the trigger conditions, and if so, an Incident will be generated.For example, selecting detection dimensions host and host_ip, the detection object can be {host: host1, host_ip: 127.0.0.1}. |
Trigger Conditions¶
Set the trigger conditions for alert levels: You can configure any one of the trigger conditions for emergency, important, warning, or normal.
Configure the trigger conditions and severity. When the query result has multiple values, an Incident is generated if any value meets the trigger conditions.
For more details, refer to Incident Level Description.
Continuous Trigger Judgment¶
If continuous trigger judgment is enabled, it means that after the trigger conditions are met multiple times in a row, an Incident is generated again. The maximum limit is 10 times.
Bulk Alert Protection¶
Enabled by default.
When the number of alerts generated in a single detection exceeds the preset threshold, the system automatically switches to the status summary strategy: Instead of processing each alert object individually, a small number of summary alerts are generated based on the Incident status and pushed.
This ensures the timeliness of notifications while significantly reducing alert noise, avoiding the risk of timeout due to processing too many alerts.
Note
When this switch is enabled, the Incident Details generated by subsequent monitor detections will not display historical records and related Incidents.
Alert Level¶
-
Alert Level Emergency (Red), Important (Orange), Warning (Yellow): Based on the configured condition judgment operators.
-
Alert Level Normal (Green): Based on the configured number of detections, explained as follows:
-
Each execution of a detection task counts as 1 detection, e.g.,
Detection Frequency = 5 minutes, then 1 detection = 5 minutes; -
The number of detections can be customized, e.g.,
Detection Frequency = 5 minutes, then 3 detections = 15 minutes.
Level Description Normal After the detection rule takes effect, if the data detection results return to normal within the configured custom number of detections after emergency, important, or warning Incidents are generated, a recovery alert Incident is generated.
❗️ Recovery alert Incidents are not subject to Alert Silence restrictions. If the number of detections for recovery alert Incidents is not set, the alert Incident will not recover and will remain in the Incidents > Unrecovered Incidents List. -
Data Gap¶
For data gap status, seven strategies can be configured.
-
Link the detection interval time range, judge the query result of the most recent minutes of the detection Metrics, do not trigger an Incident;
-
Link the detection interval time range, judge the query result of the most recent minutes of the detection Metrics, consider the query result as 0; at this time, the query result will be re-compared with the threshold configured in the Trigger Conditions above to determine whether to trigger an Incident.
-
Custom fill the detection interval value, trigger data gap Incident, trigger emergency Incident, trigger important Incident, trigger warning Incident, and trigger recovery Incident; for this type of configuration strategy, it is recommended to configure the custom data gap time >= detection interval time interval, if the configured time <= detection interval time interval, there may be situations where both data gap and abnormal conditions are met, in which case only the data gap processing result will be applied.
Information Generation¶
Enable this option to generate "Information" Incidents for detection results that do not match the above trigger conditions.
Note
If trigger conditions, data gap, and information generation are configured simultaneously, the triggering is judged in the following priority: data gap > trigger conditions > information Incident generation.
Other Configurations¶
For more details, refer to Rule Configuration.