Skip to content

On-call


The on-call feature helps teams establish a 7x24 fault response mechanism, ensuring each fault has a clear responsible person and automatically escalates when not handled within the timeout period, achieving "guaranteed alert delivery".

Core Concepts

On-call Rules

On-call rules define who is responsible for what type of faults at what time. Each rule includes the following elements:

  • On-call personnel: Members, teams, or notification targets.
  • Time period: The effective time range for the on-call duty (supports timezone settings).
  • Matching tags/dimensions: Determines which faults are routed to this rule.
  • Escalation policy: Notification escalation rules when not handled within the timeout.

Escalation Policy

An escalation policy is a multi-level notification mechanism attached to an on-call rule. When a fault is not claimed or resolved within a specified time, the system will gradually expand the notification scope according to preset levels, ensuring no fault is missed.

The Level 0 rule of the escalation policy supports selecting "Current On-call Person", which points to the on-call personnel currently in effect when the on-call strategy rotation is enabled.

Matching Tag Logic

Faults automatically match on-call rules based on their tags. Matching rules support:

  • AND: Multiple tags must all be satisfied (full match).
  • OR: Any one of multiple tags being satisfied is sufficient (partial match).
  • Wildcard: key:value* supports prefix matching.

Example:

Fault tags: {service:payment, env:prod, team:backend}

  • On-call Rule A: Tags service:payment AND env:prod → Matches ✓
  • On-call Rule B: Tag team:frontend → Does not match ✗
  • On-call Rule C: No tags (global) → Matches ✓ (fallback)
Note

If no matching tags are set, the on-call rule is considered "globally matched" and will receive all faults not matched by other rules.

On-call Calendar

The on-call calendar provides a visual scheduling view, making it easy to quickly understand current and future on-call arrangements.

  • Default view: Upon entering the on-call page, "My On-call" is displayed by default. The calendar on the right highlights all on-call schedules the current user participates in, and the list on the left shows all on-call rules that include the current user.

  • All On-call: After clicking "All On-call", the calendar on the right displays on-call schedules for all members, and the list on the left shows all on-call rules.

  • Default On-call: The system-built "Default On-call" will always be displayed in the on-call list and cannot be deleted or hidden.

  • View Details: Click on the colored blocks or member names on the calendar to display detailed information about that on-call duty, including the associated on-call rule, escalation policy, and specific on-call time periods. The top left corner supports switching timezone and date to view historical or future schedules.

On-call Management

The "On-call Management" page centrally displays all on-call rules in a list format. Each rule lists key information such as on-call timezone, execution cycle, on-call personnel, matching tags, and escalation policy. The list includes both system default on-call and custom on-call rules. Clicking any entry takes you to its details page for in-depth configuration.

To ensure fault notifications are accurately delivered and responsibilities are closed-loop, the core of configuring on-call strategies lies in establishing the following two-layer guarantee mechanism:

  1. Clarify "Who is responsible when": By setting on-call personnel, effective time periods, and enabling notification rotation (supports automatic handover by day, week, etc.), the system achieves clear scheduling and automatic rotation of responsibilities, ensuring there is always a clear "first responder" at any time.

  2. Preset escalation path ("How to report if no response"): By configuring escalation policies, a "T+N minutes" progressive notification timeline is constructed. When a fault is not handled within the set time, the system will automatically escalate the alert notification to members at other levels or broader teams according to this rule, ensuring critical faults are guaranteed to be delivered.

Create an On-call Rule

Creating an on-call rule requires completing the following configuration steps.

Basic Information

  1. Enter the on-call name.
  2. Select the timezone the on-call is based on.
  3. Select the time period covered by this on-call duty. By setting the effective time (including start time and end time), the validity period of the current on-call duty is precisely defined.

Matching Tags/Dimensions (Optional)

This section determines which faults will be handled by this rule. If no tags/dimensions are added, this rule is globally matched.

  1. Matching Tags:

    • Select existing tags from the dropdown list.
    • Supports directly entering new tags for quick creation, or directly going to "Global Tags" for management.
  2. Matching Dimensions:

    • You can select detection dimensions (such as service, host) and set specific matching values.
    • Supports logical relationships: AND (full match, all conditions must be met) or OR (partial match, any one condition being met is sufficient), default is AND.
    • Values support wildcards, format is key:value*, e.g., service:auth* can match auth-api, auth-service, etc.

On-call Personnel Settings

  1. Select On-call Personnel: Can be one or more members, or an entire team.
  2. Enable Rotation: If rotation is needed, enable the rotation function. Set the rotation cycle (e.g., daily, weekly, monthly), and the system will automatically schedule rotations in the order of the member list, visually displaying the scheduling effect in the calendar on the right.
  3. Auto-claim: When enabled, if a fault matches only one on-call person, that member will automatically be assigned as the fault handler, and the fault status will be updated to "Working". Level 0 notifications of the escalation policy will still be sent normally.

Rotation Example:

  • Before enabling rotation:

  • After enabling rotation:

Note

If the current rule does not configure any on-call personnel, you cannot add an escalation policy.

Configure Escalation Policy

The escalation policy ensures that when a fault is not handled within the timeout, the notification scope is automatically expanded to more people or higher levels (❗️The escalation policy is the core of the on-call rule, strongly recommended to configure).

Timeline Mechanism (T+N)

All time point calculations are based on the fault generation moment (denoted as T=0). The system triggers notifications at each level sequentially according to preset time intervals:

Trigger Time Level Description
T+0 Level 0 Immediate notification when fault occurs (initial)
T+5 minutes Level 1 First-level escalation
T+15 minutes Level 2 Second-level escalation
T+30 minutes Level 3 Third-level escalation

Level Configuration Description

1. Level 0 (Initial Notification) (Required)

  • Trigger timing: Immediate notification when fault occurs (T=0).
  • Notification targets: By default, the current on-call person (i.e., the on-call personnel currently in effect in the on-call rule) is filled in. Additional personnel or teams can also be added.
  • Notification method: Check individually for each notification target (email, SMS, phone call, multiple selections allowed).

2. Level 1~10 (Escalation Levels) (Optional)

  • Trigger conditions: This level will only be triggered if all the following conditions are met:
    • The fault duration has reached the set wait time (e.g., T+20 minutes).
    • The fault severity is within the specified range (e.g., only effective for P0, P1).
    • The fault status is a specified value (e.g., Open or Working).
  • Notification targets: Only notify the personnel or teams configured in this level, not those configured in Level 0.
  • Notification method: Set the notification method individually for newly added personnel.
Note

The fault severity and status range of higher levels must not exceed the range already selected by lower levels. For example, if Level 0 applies to P0/P1, then Level 1 can only select a subset of P0 or P1 (cannot expand to P2).

Repeat Notification Mechanism

Within each level, you can choose whether to enable repeat notifications:

  • Disable repeat notification: This level only sends one notification, then waits to enter the next level.
  • Enable repeat notification: Send notifications periodically at the set frequency (e.g., every 5 minutes) until the fault status changes or enters the next level.
Note

The repeat interval must be less than the wait time to enter the next level, otherwise it cannot be set.

Example:

  • Level 1 wait time: 30 minutes
  • Repeat interval: 5 minutes
  • Final effect: Send one notification each at T+5, T+10, T+15, T+20, T+25, T+30 minutes.
Note

If the last level (e.g., Level 10) has repeat notifications enabled and the fault is never handled, the system will repeat notifications indefinitely until someone claims or resolves it.

Handling Cross On-call Handover

If the fault duration spans an on-call handover time, subsequent escalation notifications will be transferred to the new on-call person and executed according to the new on-call person's escalation policy.

Example:

  • A fault occurs at 23:55, and the on-call person at that time is A.
  • The wait time for Level 1 in the escalation policy is 15 minutes, configured to repeat every 5 minutes.
  • The first repeat notification triggers 5 minutes after the fault occurs (i.e., 0:00). At this time, the on-call person has switched to B, so this notification will be sent to B, and all subsequent escalation notifications (including remaining repeats and the next level) will be executed according to B's escalation policy.

After crossing days, the system will continue processing the fault based on the new on-call person B's escalation rules.

Note

It is recommended to consider cross-day scenarios when configuring escalation policies to ensure faults can be effectively responded to at any time period.

Multi-escalation Policy Deduplication

When the same fault matches multiple on-call rules (thus matching multiple escalation policies), the system automatically performs notification deduplication to ensure the same user does not receive duplicate notifications. Deduplication logic is based on user, fault, and notification content.

Escalation Policy Configuration Example

Scenario: Escalation Policy for Core Service P0 Fault

Level Wait Time Applicable Conditions Notification Targets Notification Method
Level 0 T+0 Severity = P0 Current on-call person A SMS + Email
Level 1 T+5 minutes Severity = P0, Status = Open/Working + On-call Team Lead B B: Phone Call
Level 2 T+15 minutes Severity = P0, Status = Open/Working + Department Manager C C: Phone Call
Level 3 T+30 minutes Severity = P0, Status = Open/Working + CTO D D: Phone Call + SMS

In this example:

  • When the fault occurs, immediately notify the current on-call person A.
  • If the fault is not handled after 5 minutes, additionally notify the on-call team lead B (notification targets now are A + B).
  • If still not handled after 15 minutes, additionally notify the department manager C (notification targets now are A + B + C).
  • If still not handled after 30 minutes, additionally notify CTO D, and Level 3 has repeat notifications enabled (e.g., every 10 minutes) until someone responds.

Notification Method Description

Prerequisites
  1. The notified personnel must have configured corresponding contact information (email, phone number) in their Preferences, otherwise they cannot receive notifications via those channels.
  2. If "On-call Phone" or "On-call Email" is additionally configured in "Preferences", the system will prioritize using these dedicated contact methods for notifications to improve reliability and distinction.

The system supports three notification channels. You can check individually for each notification target:

Method Description Applicable Scenarios
Email Send email notification containing fault details and links. Non-urgent faults, scenarios requiring detailed information.
SMS Send SMS notification with concise content, only key information and links. Scenarios requiring timely awareness but not immediate phone response.
Phone IVR voice call. After connection, the alert content can be played, and keypress confirmation is required.

❗️If you need to configure on-call phone numbers for contacts in different timezones/regions, be sure to use the +area code format.
Urgent faults, ensuring guaranteed information delivery, suitable for nighttime or high priority.

Default On-call

The system has a built-in "Default On-call", which is a simplified version of an on-call rule suitable for simple scenarios. Its characteristics are as follows:

  • Only on-call personnel, personnel rotation, and escalation policy can be configured.
  • Non-configurable items: timezone (fixed as empty, follows system timezone), matching tags/dimensions (not supported for setting, defaults to global match).
  • Default On-call will always be displayed in the on-call list and cannot be deleted.

Rule Limitations

  1. An on-call rule can set up to 10 escalation levels (Level 0 + Level 1~10).
  2. The maximum single wait time is 360 minutes (6 hours), exceeding this cannot be saved.
  3. The fault severity and status range of higher levels must be a subset of the range already selected by lower levels.
Configuration Checklist

Before saving the on-call rule, it is recommended to confirm item by item:

  • Does Level 0 include the current on-call person (included by default)?
  • Is the wait time for each level reasonable? (Consider that nighttime response may require longer times.)
  • Does the final level include contacts that "must be reached no matter what"?
  • If repeat notifications are enabled, is the repeat interval less than the wait time for the next level?
  • Have all notification targets configured corresponding contact methods (especially phone)?
  • Under cross-day scenarios, does the continuity of the escalation policy meet requirements?

Next Steps

After configuring the on-call rules, you can see the fault's automatically associated on-call information in the Incident List. When a fault occurs, the system will automatically notify the corresponding personnel according to your set rules and execute the escalation policy after timeout, ensuring every fault receives a timely response.

Feedback

Is this page helpful? ×