Skip to content

Monitor Internal Principles


Due to network, system, and other limitations, there are some special treatments in the execution of monitor detections.

1. Detection Trigger Time

The detection frequency configured by users is actually internally converted into a Crontab expression. The monitor will start detection according to this Crontab expression rather than executing every N minutes after creation or saving.

Assume user sets the execution frequency of "Monitor A" as "5 minutes", then the corresponding Crontab expression would be */5 * * * *, meaning the actual trigger times are shown in the table below:

Action Time
User creates/saves monitor 00:00:30
Monitor triggers detection 00:05:00
Monitor triggers detection 00:10:00
Monitor triggers detection 00:15:00
Monitor triggers detection 00:20:00
... ...

2. Detection Range Calibration

Since the platform hosts thousands of detectors from all users, it's impossible for detections triggered at the same time point to execute simultaneously; most tasks will enter a queue and wait.

Therefore, most detection processes encounter situations wherethe task should have executed at time T but actually executes several seconds later (T + seconds).

If we directly use the actual execution time as the end time and query data based on the detection range, there will inevitably be overlapping or incomplete coverage issues, such as:

Assuming the detection range is 5 minutes:

Action Time Detection Range
1. Actual execution 00:05:10 00:00:10 ~ 00:05:10
2. Actual execution 00:10:05 00:05:05 ~ 00:10:05
3. Actual execution 00:15:30 00:10:30 ~ 00:15:30

As seen above, between action 1 and 2, there is an overlap in the detection range from 00:05:05 to 00:05:10.

Between action 2 and 3, data from 00:10:05 to 00:10:30 was not covered by any detection.

Current Solution

To avoid fluctuations in the detection range caused by queuing delays, the detection range of monitors (i.e., DQL query data range) will be calibrated based on their trigger times, for example:

Assuming the detection range is 5 minutes:

Action Time Detection Range
Monitor triggers detection (queued) 00:05:00
Monitor actual execution (dequeued) 00:05:10 00:00:00 ~ 00:05:00
Monitor triggers detection (queued) 00:10:00
Monitor actual execution (dequeued) 00:10:30 00:05:00 ~ 00:10:00

It can be seen that no matter how long a detection waits in the queue, its detection range is determined based on the [trigger time], without fluctuation due to actual execution time.

Note: The table above only illustrates the concept of "detection range calibration." The actual detection range may also be affected by "detection range drift."

3. Detection Range Drift

Affected by network latency and data disk-write delay, data reporting generally has several seconds, even dozens of seconds, of delay. This is specifically manifested as newly reported data being unqueryable via DQL immediately.

Therefore, during detection processing, it is very easy for missing data within each detection range to occur, such as:

Assuming the detection range is 5 minutes:

Action Time Detection Range
Data A reported (not yet persisted) 00:09:59
Monitor triggers detection 00:10:00 00:05:00 ~ 00:10:00
Data A persisted 00:10:05 (timestamp: 00:09:59)
Monitor triggers detection 00:15:00 00:10:00 ~ 00:15:00

It can be seen that although Data A was reported before detection execution, it could not be queried by the monitor during detection due to not having been persisted yet.

Additionally, even after Data A is persisted, since its timestamp is earlier, the next round of monitoring still cannot detect Data A.

Current Solution

To solve the issue mentioned above, when executing detection, all monitors automatically shift their detection range one minute earlier to avoid querying data that is still in the process of being written to storage.

In this case, the previous example becomes:

Assuming the detection range is 5 minutes:

Action Time Detection Range
Data A reported (not yet persisted) 00:09:59
Monitor triggers detection 00:10:00 00:04:00 ~ 00:09:00 (shifted 1 minute)
Data A persisted 00:10:05 (timestamp: 00:09:59)
Monitor triggers detection 00:15:00 00:09:00 ~ 00:14:00 (shifted 1 minute)

It can be seen that although Data A had a persistence delay, it could still be detected in the second round of detection, thus avoiding detection data loss caused by persistence delay.

Note: If the persistence delay exceeds 1 minute, this solution will fail, and detection will not produce the expected effect.

4. Logic for Detecting Data Gap

Since Guance is a time-series data-based platform, unlike non-asset management software, there isn't an asset list master table. In reality, we can only confirm what data exists through queries but cannot know what does not exist.

Imagine I have a box containing pencils and erasers.

Then I can clearly state, "there are pencils and erasers in the box,"

But I cannot tell you what is NOT in the box.

Because I don't know what "should originally be in the box" (i.e., the asset master table).

Therefore, in monitors, "data gap detection" can only judge data gaps and recoveries using an "edge-triggered" approach.

That is, "if in the previous query I found X, but in this round I cannot find X, then X has experienced a data gap."

For example:

Assuming the detection range is 5 minutes:

00:00:00 ~ 00:05:00 00:05:00 ~ 00:10:00 Judgment Result
Has data Data gap
Has data Has data Continuously normal
Has data Data re-uploaded
Continuous data gap (meaningless judgment)

Note: The table above is only an illustration describing the "logic for detecting data gaps." The actual detection ranges will also be influenced by configurations such as "detection range drift," "detection range," and "data gap occurred continuously for N minutes."

5. Data Gap / Data Gap Recovery Events

When a monitor detects "data gap" or "data re-uploaded," depending on different configurations set by the user, it might generate either a "Data Gap Event" or a "Data Gap Recovery Event."

However, to prevent meaningless repeated alerts, before generating these events, existing events will be checked to determine whether a corresponding event needs to be generated:

Previous Event Current Detection Result Outcome
No event/Data Gap Recovery Event Data Gap Generate Data Gap Event
No event/Data Gap Event Data Re-uploaded Generate Data Gap Recovery Event

Thus, "Data Gap Events" and "Data Gap Recovery Events" always appear in pairs; there won't be consecutive "Data Gap Events" or consecutive "Data Gap Recovery Events."

X. Frequently Asked Questions

The time marked in the event doesn't match the event generation time

Event details and alert notifications will both indicate a time, e.g., 00:15:00.

This indicated time is always the monitor's "detection trigger time"—that is, the time expressed by the Crontab expression—and must be a multiple of the regular detection interval.

Exception: If you manually click Execute on the monitor list, the corresponding generated event time will be the actual execution time.

The time marked in the event doesn't match the actual fault data point time

The time marked in events is always the monitor's "detection trigger time," and due to "detection range drift," the actual fault data point time may indeed fall outside the interval of "detection trigger time - detection range ~ detection trigger time."

Because the actual detection range is "detection trigger time - detection range - drift time ~ detection trigger time - drift time."

This situation is considered normal.

Directly viewing data in Guance indicates a fault should exist, but the monitor did not detect it

The causes of this problem are as follows:

  1. Due to excessive data reporting write-in delay, the fault data point cannot be queried when detection is executed.
  2. Detection execution failed due to DQL query failure.

These situations are beyond the control scope of the monitor.

Feedback

Is this page helpful? ×