Skip to content

How Monitors Work Internally


Due to factors like network conditions and system load, the detection execution of monitors involves some special internal processing mechanisms.

Detection Trigger Time

The detection frequency configured by users is internally converted into a Crontab expression. Monitors will strictly start according to this scheduled expression, rather than simply executing every N minutes after creation or saving.

For example, if a user configures an execution frequency of "5 minutes" for "Monitor A", the corresponding Crontab expression is */5 * * * *. The specific trigger time relationship is as follows:

Action Time
User creates/saves monitor 00:00:30
Monitor triggers detection 00:05:00
Monitor triggers detection 00:10:00
... ...

Detection Range Calibration

Since the platform needs to process thousands of monitors configured by all users, detection tasks triggered at the same time cannot be executed simultaneously. Most tasks will enter a queue and wait.

Therefore, most detection tasks will encounter the situation where they are scheduled to trigger at time T, but actually execute at time T + Δt.

If the actual execution time is directly used as the end time for the query, it will cause overlaps or gaps in the detection time range. For example:

Assume the detection time range is 5 minutes:

Action Time Actual queried data range
1. Actual execution 00:05:10 00:00:10 ~ 00:05:10
2. Actual execution 00:10:05 00:05:05 ~ 00:10:05
3. Actual execution 00:15:30 00:10:30 ~ 00:15:30

In this case: - The detection ranges of "Action 1" and "Action 2" overlap between 00:05:05 ~ 00:05:10; - The detection ranges of "Action 2" and "Action 3" have a gap between 00:10:05 ~ 00:10:30, causing data in this period to be uncovered.

Current Solution

To avoid fluctuations in the detection range caused by task queuing, the data query range of monitors is calibrated based on their scheduled trigger time, not the actual execution time.

Assume the detection time range is 5 minutes:

Action Time Final queried data range
Monitor triggers detection (enqueued) 00:05:00
Monitor actually executes (dequeued) 00:05:10 00:00:00 ~ 00:05:00
Monitor triggers detection (enqueued) 00:10:00
Monitor actually executes (dequeued) 00:10:30 00:05:00 ~ 00:10:00

As can be seen, regardless of how long the detection task queues, its data query range is always based on the scheduled trigger time, ensuring the continuity and stability of the time window.

Note

The above examples are only to illustrate the principle of "Detection Range Calibration". The actual range is also affected by the "Detection Range Drift" mechanism.

Detection Range Drift

Due to factors like network latency and data processing, it usually takes several seconds to tens of seconds for data to be written to disk (i.e., queryable via DQL) after being reported. During this period, these "in-transit" data cannot be queried by monitor detection.

This can easily lead to data missing when detecting with a fixed time range. For example:

Assume the detection time range is 5 minutes:

Action Time Detection range
Report data A (not yet written) 00:09:59
Monitor triggers detection 00:10:00 00:05:00 ~ 00:10:00
Data A written to disk 00:10:05 (data timestamp 00:09:59)
Monitor triggers detection 00:15:00 00:10:00 ~ 00:15:00

Although data A was reported before the second detection execution, it was missed in the first detection because it wasn't written to disk yet. After it was written, its timestamp was too early and no longer within the time range of subsequent detections, thus persistently unable to be detected.

Current Solution

To solve this problem, all monitors automatically drift the data query range 1 minute into the past during detection execution to avoid the data writing window.

After applying this solution, the above example becomes:

Assume the detection time range is 5 minutes:

Action Time Detection range
Report data A (not yet written) 00:09:59
Monitor triggers detection 00:10:00 00:04:00 ~ 00:09:00 (drifted 1 minute)
Data A written to disk 00:10:05 (data timestamp 00:09:59)
Monitor triggers detection 00:15:00 00:09:00 ~ 00:14:00 (drifted 1 minute)

Although data A was missed in the detection at 00:10:00, its timestamp 00:09:59 fell within the range of the detection at 00:15:00 (00:09:00 ~ 00:14:00), so it was successfully captured in the second detection.

Note

If the data writing delay exceeds 1 minute, this solution will fail, and the detection may not achieve the expected effect.

Data Gap Judgment Logic

Guance, as a time-series data platform, does not have the concept of an "asset master table" found in traditional asset management software. The system can only judge "what exists" based on the data it has queried, but cannot know about objects that "should exist but currently do not".

For example:

A box is known to contain a pencil and an eraser. We can clearly say "the box contains a pencil and an eraser", but we cannot assert that "the box does not contain a pen" because we do not know what the box "should" contain.

Therefore, the "data gap detection" in monitors actually uses an "edge-triggered" mechanism for judgment, i.e., it discovers changes by comparing the query results of two consecutive executions.

Its core logic is: "If object X was found to exist in the previous detection round, but disappears in this detection round, then X is judged to have a data gap."

Assume the detection time range is 5 minutes:

00:00:00 ~ 00:05:00 Result 00:05:00 ~ 00:10:00 Result Judgment Result
Data detected No data detected Data gap
Data detected Data detected Continuous normal
No data detected Data detected Data re-reported
No data detected No data detected Continuous no data (meaningless state)
Note

The above examples are only to illustrate the core logic. The actual judgment is also affected by configurations such as "Detection Range Drift", "Detection Time Range", and "Alert only after N consecutive minutes of no data".

Data Gap / Data Recovery Events

When a monitor determines that a "data gap" or "data re-report" has occurred, it decides whether to generate a "data gap event" or a "data gap recovery event" based on user configuration.

To avoid generating duplicate or meaningless alerts, the system refers to the status of existing events before generating an event:

Existing Event Status Current Detection Result System Action
No event / Data gap recovery event Data gap Generate data gap event
No event / Data gap event Data re-reported Generate data gap recovery event

Therefore, "data gap events" and "data gap recovery events" always alternate; there will not be consecutive data gap events or consecutive recovery events.

Frequently Asked Questions

The time indicated by the event does not match the time the event was generated

The time seen in event details or alert notifications (e.g., 00:15:00) is the monitor's scheduled trigger time (i.e., the regular time based on the Crontab expression), not the actual time the event was generated in the system.

Exception: If you manually click "Execute" in the monitor list, the event time generated will be the actual time you clicked execute.

The time indicated by the event does not match the actual time the fault occurred

The time indicated by the event is the monitor's scheduled trigger time. Furthermore, due to the "Detection Range Drift" mechanism, the actual data range being detected is from Scheduled trigger time - Detection range - Drift time to Scheduled trigger time - Drift time.

Therefore, the timestamp of the data point where the actual fault occurred likely does not fall within the intuitive interval from Scheduled trigger time - Detection range to Scheduled trigger time. This is normal behavior.

Suspected faulty data is found when querying directly in the platform, but the monitor did not generate an alert

This issue is usually caused by the following reasons:

  1. Excessively high data writing delay: The faulty data was not yet queryable due to delay when the detection executed;
  2. DQL query execution failure: The detection process was interrupted due to a query failure.

Such situations usually stem from issues in the data pipeline or query engine, which are beyond the control of the monitor itself.

Feedback

Is this page helpful? ×