Monitor Internal Principles¶
Due to network, system, and other limitations, there are some special treatments in the execution of monitor detections.
1. Detection Trigger Time¶
The detection frequency configured by users is actually internally converted into a Crontab expression. The monitor will start detection according to this Crontab expression rather than executing every N minutes after creation or saving.
Assume user sets the execution frequency of "Monitor A" as "5 minutes", then the corresponding Crontab expression would be */5 * * * *
, meaning the actual trigger times are shown in the table below:
Action | Time |
---|---|
User creates/saves monitor | 00:00:30 |
Monitor triggers detection | 00:05:00 |
Monitor triggers detection | 00:10:00 |
Monitor triggers detection | 00:15:00 |
Monitor triggers detection | 00:20:00 |
... | ... |
2. Detection Range Calibration¶
Since the platform hosts thousands of detectors from all users, it's impossible for detections triggered at the same time point to execute simultaneously; most tasks will enter a queue and wait.
Therefore, most detection processes encounter situations wherethe task should have executed at time T but actually executes several seconds later (T + seconds).
If we directly use the actual execution time as the end time and query data based on the detection range, there will inevitably be overlapping or incomplete coverage issues, such as:
Assuming the detection range is 5 minutes:
Action | Time | Detection Range |
---|---|---|
1. Actual execution | 00:05:10 |
00:00:10 ~ 00:05:10 |
2. Actual execution | 00:10:05 |
00:05:05 ~ 00:10:05 |
3. Actual execution | 00:15:30 |
00:10:30 ~ 00:15:30 |
As seen above, between action 1 and 2, there is an overlap in the detection range from 00:05:05
to 00:05:10
.
Between action 2 and 3, data from 00:10:05
to 00:10:30
was not covered by any detection.
Current Solution¶
To avoid fluctuations in the detection range caused by queuing delays, the detection range of monitors (i.e., DQL query data range) will be calibrated based on their trigger times, for example:
Assuming the detection range is 5 minutes:
Action | Time | Detection Range |
---|---|---|
Monitor triggers detection (queued) | 00:05:00 |
|
Monitor actual execution (dequeued) | 00:05:10 |
00:00:00 ~ 00:05:00 |
Monitor triggers detection (queued) | 00:10:00 |
|
Monitor actual execution (dequeued) | 00:10:30 |
00:05:00 ~ 00:10:00 |
It can be seen that no matter how long a detection waits in the queue, its detection range is determined based on the [trigger time], without fluctuation due to actual execution time.
Note: The table above only illustrates the concept of "detection range calibration." The actual detection range may also be affected by "detection range drift."
3. Detection Range Drift¶
Affected by network latency and data disk-write delay, data reporting generally has several seconds, even dozens of seconds, of delay. This is specifically manifested as newly reported data being unqueryable via DQL immediately.
Therefore, during detection processing, it is very easy for missing data within each detection range to occur, such as:
Assuming the detection range is 5 minutes:
Action | Time | Detection Range |
---|---|---|
Data A reported (not yet persisted) | 00:09:59 |
|
Monitor triggers detection | 00:10:00 |
00:05:00 ~ 00:10:00 |
Data A persisted | 00:10:05 (timestamp: 00:09:59 ) |
|
Monitor triggers detection | 00:15:00 |
00:10:00 ~ 00:15:00 |
It can be seen that although Data A was reported before detection execution, it could not be queried by the monitor during detection due to not having been persisted yet.
Additionally, even after Data A is persisted, since its timestamp is earlier, the next round of monitoring still cannot detect Data A.
Current Solution¶
To solve the issue mentioned above, when executing detection, all monitors automatically shift their detection range one minute earlier to avoid querying data that is still in the process of being written to storage.
In this case, the previous example becomes:
Assuming the detection range is 5 minutes:
Action | Time | Detection Range |
---|---|---|
Data A reported (not yet persisted) | 00:09:59 |
|
Monitor triggers detection | 00:10:00 |
00:04:00 ~ 00:09:00 (shifted 1 minute) |
Data A persisted | 00:10:05 (timestamp: 00:09:59 ) |
|
Monitor triggers detection | 00:15:00 |
00:09:00 ~ 00:14:00 (shifted 1 minute) |
It can be seen that although Data A had a persistence delay, it could still be detected in the second round of detection, thus avoiding detection data loss caused by persistence delay.
Note: If the persistence delay exceeds 1 minute, this solution will fail, and detection will not produce the expected effect.
4. Logic for Detecting Data Gap¶
Since Guance is a time-series data-based platform, unlike non-asset management software, there isn't an asset list master table. In reality, we can only confirm what data exists through queries but cannot know what does not exist.
Imagine I have a box containing pencils and erasers.
Then I can clearly state, "there are pencils and erasers in the box,"
But I cannot tell you what is NOT in the box.
Because I don't know what "should originally be in the box" (i.e., the asset master table).
Therefore, in monitors, "data gap detection" can only judge data gaps and recoveries using an "edge-triggered" approach.
That is, "if in the previous query I found X, but in this round I cannot find X, then X has experienced a data gap."
For example:
Assuming the detection range is 5 minutes:
00:00:00 ~ 00:05:00 |
00:05:00 ~ 00:10:00 |
Judgment Result |
---|---|---|
Has data | Data gap | |
Has data | Has data | Continuously normal |
Has data | Data re-uploaded | |
Continuous data gap (meaningless judgment) |
Note: The table above is only an illustration describing the "logic for detecting data gaps." The actual detection ranges will also be influenced by configurations such as "detection range drift," "detection range," and "data gap occurred continuously for N minutes."
5. Data Gap / Data Gap Recovery Events¶
When a monitor detects "data gap" or "data re-uploaded," depending on different configurations set by the user, it might generate either a "Data Gap Event" or a "Data Gap Recovery Event."
However, to prevent meaningless repeated alerts, before generating these events, existing events will be checked to determine whether a corresponding event needs to be generated:
Previous Event | Current Detection Result | Outcome |
---|---|---|
No event/Data Gap Recovery Event | Data Gap | Generate Data Gap Event |
No event/Data Gap Event | Data Re-uploaded | Generate Data Gap Recovery Event |
Thus, "Data Gap Events" and "Data Gap Recovery Events" always appear in pairs; there won't be consecutive "Data Gap Events" or consecutive "Data Gap Recovery Events."
X. Frequently Asked Questions¶
The time marked in the event doesn't match the event generation time¶
Event details and alert notifications will both indicate a time, e.g., 00:15:00
.
This indicated time is always the monitor's "detection trigger time"—that is, the time expressed by the Crontab expression—and must be a multiple of the regular detection interval.
Exception: If you manually click Execute on the monitor list, the corresponding generated event time will be the actual execution time.
The time marked in the event doesn't match the actual fault data point time¶
The time marked in events is always the monitor's "detection trigger time," and due to "detection range drift," the actual fault data point time may indeed fall outside the interval of "detection trigger time - detection range ~ detection trigger time."
Because the actual detection range is "detection trigger time - detection range - drift time ~ detection trigger time - drift time."
This situation is considered normal.
Directly viewing data in Guance indicates a fault should exist, but the monitor did not detect it¶
The causes of this problem are as follows:
- Due to excessive data reporting write-in delay, the fault data point cannot be queried when detection is executed.
- Detection execution failed due to DQL query failure.
These situations are beyond the control scope of the monitor.