Studio Self-Monitoring Configuration and Metrics Explanation¶

This document explains how to confirm whether self-monitoring configuration is enabled for the Deployment Plan Studio, and details the metrics, tags, units, and monitoring recommendations related to APIs, Celery asynchronous tasks, Redis/Broker, business tasks, and the export pipeline within the self-monitoring Measurement df_studio.

Applicable Versions¶

The self-monitoring active metrics capability is provided starting from the release version on May 20, 2026.
The release version on May 13, 2026 does not yet support this active metrics capability.
It has been confirmed in the Lark issue ticket that the latest Deployment Plan v1.130.225 supports this capability. This version corresponds to the current Studio system commit 60a71d992. The metrics and configurations in this document have been verified against this commit.
If the environment is below v1.130.225, it is recommended to upgrade first and then configure.

Collection Pipeline¶

The Studio application side does not actively push metrics to external services. The recommended pipeline is:

Studio API / Celery / WebSocket / Snapshot
  -> Lightweight metric recording within the application
  -> Redis metric cache
  -> inner /metrics Prometheus text endpoint
  -> Datakit periodic pull
  -> Self-monitoring Workspace
  -> Dashboard / Monitor / Alert

Datakit pull address:

http://<inner-service-ip>:5000/api/v1/inner/metrics?from=datakit&type=df_studio

The Prometheus text endpoint outputs the full metric name, e.g., df_studio_celery_task_published_total. In the Guance UI or DQL, queries are typically performed by "Measurement + field", i.e., the Measurement is df_studio, and the field is celery_task_published_total.

How to Check if Self-Monitoring Configuration is Enabled¶

1. Check Studio Backend Configuration¶

The Studio backend configuration item is SelfMonitorMetricsSet. It is disabled by default. Users only need to explicitly enable enable:

SelfMonitorMetricsSet:
  enable: true

Other configurations can remain at their defaults. Their meanings are as follows:

Configuration Item	Default Value	Unit	Description
`enable`	`false`	Boolean	Unified switch for self-monitoring. Only when set to `true` will metrics related to APIs, Celery, business tasks, and `/metrics` export be recorded.
`expireSeconds`	`3600`	seconds	Retention window for periodic incremental metrics in Redis.
`stateExpireSeconds`	`604800`	seconds	Retention window for stateful metrics like beat last publish time, business task last success/failure.
`beatMissedLagThresholdSeconds`	`300`	seconds	Default lag threshold for determining if a beat execution hasn't started after publishing.
`beatMissedIntervalMultiplier`	`2`	multiplier	Multiplier of the recent publish interval used to determine if a low-frequency beat missed scheduling.
`celeryQueues`	`celery`, `correlation_task`, `snapshot_queue`, `compute_task`	list	Celery queues for which queue length and oldest wait time need to be read.

It can also be overridden via environment variables:

STUDIO__SelfMonitorMetricsSet__enable=true

Note: enable must be a boolean semantic true or false. Invalid strings or null will cause configuration loading to fail.

2. Check if `/metrics` Outputs Self-Monitoring Metrics¶

Access the inner service within the cluster:

curl 'http://management-backend.forethought-core:5000/api/v1/inner/metrics?from=datakit&type=df_studio'

If enabled and exporting normally, the response should contain content similar to:

df_studio_self_monitor_export_total{exporter="prometheus_inner",result="success"} 1
df_studio_self_monitor_export_duration_seconds{exporter="prometheus_inner",result="success"} ...
df_studio_self_monitor_export_last_success_timestamp_seconds{exporter="prometheus_inner"} ...

If an exception occurs during export, the interface will fail-open, still attempting to return failure metrics:

df_studio_self_monitor_export_total{exporter="prometheus_inner",result="failure"} 1
df_studio_self_monitor_export_error_total{exception_type="...",exporter="prometheus_inner"} 1
df_studio_self_monitor_export_last_failure_timestamp_seconds{exception_type="...",exporter="prometheus_inner"} ...

3. Check Historical Health Check Interface¶

The management backend still retains the Celery worker health check interface:

curl 'http://management-backend.forethought-core:5000/api/v1/const/celery/ping'

This interface reads celery_active_point from Redis and returns the last active time for each queue. A 200 response indicates there is an active point within the configured valid offset time. A 400 response usually indicates that the corresponding worker hasn't updated its active point for a long time, which may be due to the worker not running, task backlog, Redis/Broker connection issues, etc.

This interface is suitable as a compatible health check. For complete self-monitoring, it is recommended to prioritize using the df_studio metrics described below.

Metrics and Tag Conventions¶

Global Tags¶

Tag	Applicable Scope	Meaning	Common Values	Usage Suggestions
`service`	API	Service entry name	`front`, `inner`, `openapi`, `admin`, `external`, `center`, `aiapi`, `sse`	Low cardinality, suitable for overviews.
`run_app_code`	API	Current process run entry	Same as `service`	Low cardinality, useful for distinguishing entries.
`route_rule`	API	Flask route rule	`/api/v1/...`	More suitable for aggregation than raw URLs.
`method`	API	HTTP method	`GET`, `POST`, `PUT`	Low cardinality.
`status_class`	API	HTTP status code family	`2xx`, `4xx`, `5xx`	Used for success rate, error rate.
`queue`	Celery	Celery queue name	`celery`, `correlation_task`, `snapshot_queue`, `compute_task`	Low cardinality, core dimension for asynchronous task overview.
`task`	Celery / Business tasks	Celery task name or business task name	`forethought.tasks...`, `statistics_upload`	Medium cardinality, used for task-level troubleshooting.
`status`	Celery	Task end status	`success`, `failure`, `retry`	Used for task quality analysis.
`exception_type`	Celery / Export pipeline	Exception type	`TimeoutError`, `OperationalError`	Used for exception TopN.
`beat_name`	Celery beat	Beat entry name	Beat entry name from configuration	Used to determine if scheduled tasks missed scheduling.
`domain`	Business tasks	Business domain	`archive_report`, `incidents`, `billing`, `cleanup`	Low cardinality, primary dimension for business task overview.
`result`	Business tasks / Export pipeline	Execution result	`success`, `error`, `failure`, `partial_success`, `skipped`	Used for success and failure rates.
`item_type`	Business tasks	Processed object type	`workspace`, `report_task`, `notification`	Low cardinality.
`reason`	Business tasks	Partial failure reason	`notify_failed`, `item_error`	Can be used for alerts after controlled enumeration.
`entry`	Independent entry	Non-Flask entry	`websocket`, `snapshot`	Used for independent entry health.
`event`	Independent entry	Entry event	`connect`, `disconnect`, `send_task`	Used for entry event analysis.
`state`	Current state metrics	State name	`size`, `checked_out`, `overflow`	Specific meaning depends on the metric.
`exporter`	`/metrics` export	Exporter name	`prometheus_inner`	Low cardinality.
`le`	Histogram bucket	Bucket upper bound	`0.1`, `1`, `5`, `+Inf`	Used only for `_bucket` metrics to calculate percentiles.

le represents the less-than-or-equal-to upper bound of a histogram bucket, not a business dimension. For example, le="1" indicates the cumulative count of samples less than or equal to 1 second, le="+Inf" indicates the total number of all samples.

API Metrics¶

Metric Field	Unit	Tags	Meaning
`api_request_count`	count	`service`, `api_path`	Compatible with old API non-5xx request count.
`api_request_error_count`	count	`service`, `api_path`	Compatible with old API 5xx request count.
`api_requests_total`	count	`service`, `run_app_code`, `route_rule`, `method`, `status_class`	Total API request count, periodic increment.
`api_errors_total`	count	`service`, `run_app_code`, `route_rule`, `method`, `status_class`, `error_type`	API error count, currently mainly covering HTTP 5xx.
`api_duration_seconds_bucket`	seconds	`service`, `run_app_code`, `route_rule`, `method`, `status_class`, `le`	API request duration distribution.
`api_duration_seconds_sum`	seconds	`service`, `run_app_code`, `route_rule`, `method`, `status_class`	Sum of API request durations.
`api_duration_seconds_count`	count	`service`, `run_app_code`, `route_rule`, `method`, `status_class`	Number of API request duration samples.

Celery Queue and Task Metrics¶

The following metrics have been written via Celery signals in commit 60a71d992 and are exported by the df_studio Measurement. worker_queue_count and celery_queue_oldest_wait_seconds directly read the Redis broker queue, used to detect Redis/Broker queue backlog or worker non-consumption. Celery task lifecycle metrics are used to further distinguish between "not started consuming" and "stuck after starting".

Metric Field	Unit	Tags	Meaning
`worker_queue_count`	count	`queue`	Current length of the Redis broker queue.
`celery_queue_oldest_wait_seconds`	seconds	`queue`	Wait time from publish to current for the oldest task in the queue.
`celery_task_published_total`	count	`task`, `queue`	Number of Celery task publications.
`celery_task_started_total`	count	`task`, `queue`	Number of Celery task execution starts.
`celery_task_finished_total`	count	`task`, `queue`, `status`	Number of Celery task completions, distinguished by status.
`celery_task_active`	count	`task`, `queue`	Number of Celery tasks currently executing.
`celery_task_duration_seconds_bucket`	seconds	`task`, `queue`, `le`	Task execution duration distribution.
`celery_task_duration_seconds_sum`	seconds	`task`, `queue`	Sum of task execution durations.
`celery_task_duration_seconds_count`	count	`task`, `queue`	Number of task execution duration samples.
`celery_task_queue_wait_seconds_bucket`	seconds	`task`, `queue`, `le`	Distribution of queue wait time from task publish to execution start.
`celery_task_queue_wait_seconds_sum`	seconds	`task`, `queue`	Sum of task queue wait times.
`celery_task_queue_wait_seconds_count`	count	`task`, `queue`	Number of task queue wait time samples.
`celery_task_failure_exception_total`	count	`task`, `queue`, `exception_type`	Distribution of task failure exception types.
`celery_task_timeout_total`	count	`task`, `queue`, `timeout_type`	Number of Celery soft/hard timeout occurrences.
`celery_task_retry_total`	count	`task`, `queue`, `exception_type`	Number of task retries.
`celery_task_retry_delay_seconds_bucket`	seconds	`task`, `queue`, `le`	Task retry delay distribution.
`celery_task_retry_delay_seconds_sum`	seconds	`task`, `queue`	Sum of task retry delays.
`celery_task_retry_delay_seconds_count`	count	`task`, `queue`	Number of task retry delay samples.

Beat and Scheduled Task Metrics¶

Metric Field	Unit	Tags	Meaning
`celery_beat_task_last_publish_timestamp_seconds`	Unix seconds	`beat_name`, `task`	Last publish time of the beat entry task.
`celery_beat_task_last_started_timestamp_seconds`	Unix seconds	`beat_name`, `task`	Last execution start time of the beat entry's corresponding task.
`celery_beat_lag_seconds`	seconds	`beat_name`, `task`	Lag from beat task publish to worker execution start.
`celery_beat_publish_interval_seconds`	seconds	`beat_name`, `task`	Actual interval between the last two publishes of the beat entry.
`celery_beat_missed`	boolean	`beat_name`, `task`	Whether scheduling is suspected to be missed, `1` indicates suspected missed scheduling.

Business Task Metrics¶

Metric Field	Unit	Tags	Meaning
`business_task_runs_total`	count	`domain`, `task`, `result`	Number of business task runs.
`business_task_items_total`	count	`domain`, `task`, `item_type`, `result`	Number of objects processed by business tasks.
`business_task_duration_seconds_bucket`	seconds	`domain`, `task`, `result`, `le`	End-to-end duration distribution of business tasks.
`business_task_duration_seconds_sum`	seconds	`domain`, `task`, `result`	Sum of end-to-end durations of business tasks.
`business_task_duration_seconds_count`	count	`domain`, `task`, `result`	Number of end-to-end duration samples for business tasks.
`business_task_last_success_timestamp_seconds`	Unix seconds	`domain`, `task`	Last successful time of the business task.
`business_task_last_failure_timestamp_seconds`	Unix seconds	`domain`, `task`, `exception_type`	Last failure time of the business task.
`business_task_partial_failure_total`	count	`domain`, `task`, `reason`	Number of times a task did not fail overall but had partial failures.

Currently integrated business domains include:

`domain`	Typical Tasks	Focus Points
`archive_report`	Archive report v2/v3, first-cycle notification, delayed notification	Whether report triggering, screenshots, notifications are successful, presence of partial failures.
`incidents`	Incident duty policy analysis, incident queue sync, incident notification sending	Whether the incident notification pipeline is successful, presence of backlog.
`billing`	Billing statistics reporting	Whether on time, successful, number of workspaces processed.
`workspace_usage`	OpenAPI API Key usage database refresh	Whether usage refresh is successful, number of buckets and access keys processed.
`cleanup`	Dashboard history cleanup, etc.	Whether cleanup tasks are failing long-term or skipped.
`sync_config`	Integration template synchronization	Whether configuration synchronization is successful.
`notification`	Status Page status change notification	Whether notification tasks succeed or fail.
`keyevent`	Critical event unresolved asynchronous query	Whether critical event asynchronous queries are abnormal.
`cloud_collector`	Cloud collector asynchronous operations	Asynchronous operation splitting, lock waiting, success/failure.
`catalog`	Unified catalog entity health	Whether entity health tasks are on time, successful, and processing volume is abnormal.
`snapshot`	Dashboard screenshot, chart screenshot, chart data generation	Snapshot service screenshot/chart data task results.

Independent Entry and Dependency Health Metrics¶

Metric Field	Unit	Tags	Meaning
`service_entry_events_total`	count	`entry`, `event`, `result`	Number of events for non-Flask entries like WebSocket, snapshot.
`service_entry_active`	count/boolean	`entry`, `state`	Current active state of non-Flask entries.
`dependency_db_pool_connections`	count	`pool`, `state`	Current state of the database connection pool in the exporter's process, `state` includes `size`, `checked_in`, `checked_out`, `overflow`.
`self_monitor_export_total`	count	`exporter`, `result`	Result of this `/metrics` export.
`self_monitor_export_points_total`	count	`exporter`, `result`	Number of Prometheus samples successfully exported in this `/metrics` export.
`self_monitor_export_duration_seconds`	seconds	`exporter`, `result`	Duration of this `/metrics` export.
`self_monitor_export_last_success_timestamp_seconds`	Unix seconds	`exporter`	Last successful export time.
`self_monitor_export_last_failure_timestamp_seconds`	Unix seconds	`exporter`, `exception_type`	Last fail-open failure export time.
`self_monitor_export_error_total`	count	`exporter`, `exception_type`	This fail-open failure event.

Asynchronous Task and Redis/Broker Monitoring Recommendations¶

Customer concerns like "are asynchronous tasks abnormal, is Redis disconnected, are workers stuck" cannot be judged by a single metric. It is recommended to use combined conditions.

Scenario	Priority Observation Metrics	Recommended Dimensions	Interpretation Method
Worker not consuming or insufficient consumption capacity	`worker_queue_count`, `celery_queue_oldest_wait_seconds`, `celery_task_published_total`, `celery_task_started_total`	`queue`, `task`	Queue length and oldest wait time continuously rising, published increasing but started very low, usually indicates worker not consuming, insufficient consumption, or connection issues with broker.
Redis/Broker readable but worker disconnected	`worker_queue_count`, `celery_queue_oldest_wait_seconds`, `celery_task_active`	`queue`	Exporter can read the queue, queue backlog rising, but active is 0 long-term or significantly low, suspect worker-side disconnection, hang, or not started.
Redis/Broker completely unavailable or exporter read failure	`self_monitor_export_total`, `self_monitor_export_error_total`, `self_monitor_export_last_failure_timestamp_seconds`, `self_monitor_export_points_total`	`exporter`, `exception_type`	If `/metrics` fails-open, failure time refreshes, sample count drops significantly, it indicates the collection pipeline itself may have failed to access Redis, DB, or metric source.
Task starts but gets stuck and doesn't finish	`celery_task_active`, `celery_task_started_total`, `celery_task_finished_total`, `celery_task_duration_seconds_bucket`	`queue`, `task`	Active doesn't decrease for a long time, started increases but finished doesn't, or P99 duration continuously rises, indicating tasks may be stuck on external calls, locks, DB, or loop logic.
Task failure or retry storm	`celery_task_finished_total`, `celery_task_failure_exception_total`, `celery_task_retry_total`, `celery_task_retry_delay_seconds_bucket`	`task`, `exception_type`	Failure/retry both rising, and exception types concentrated, indicates tasks may have entered a failure-retry loop.
Beat publishes normally but worker doesn't start	`celery_beat_task_last_publish_timestamp_seconds`, `celery_beat_task_last_started_timestamp_seconds`, `celery_beat_lag_seconds`, `celery_beat_missed`	`beat_name`, `task`	last_publish updates but last_started doesn't, lag rises or missed=1, indicates scheduled task delivered but worker hasn't started consuming.
Beat stops publishing or low-frequency task misses scheduling	`celery_beat_publish_interval_seconds`, `celery_beat_task_last_publish_timestamp_seconds`, `celery_beat_missed`	`beat_name`, `task`	publish interval exceeds historical period or last_publish too old, indicates beat may have stopped, configuration not enabled, or scheduler abnormal.
Business task overall success but partial object failure	`business_task_partial_failure_total`, `business_task_items_total`, `business_task_runs_total`	`domain`, `task`, `reason`, `item_type`	partial failure increases but overall task may still be `partial_success`, need to look at specific business object failure reasons.
Business task no success for a long time	`business_task_last_success_timestamp_seconds`, `business_task_last_failure_timestamp_seconds`, `business_task_runs_total`	`domain`, `task`	last_success too far from current time, and last_failure updates or runs have no success, indicates this business pipeline may be failing silently.

It is recommended to at least set up the following alerts:

Alert Item	Suggested Severity	Suggested Condition
Self-monitoring export failure	P0	`self_monitor_export_total{result="failure"}` appears or `self_monitor_export_error_total` appears.
Self-monitoring no success for a long time	P0	Current time minus `self_monitor_export_last_success_timestamp_seconds` exceeds 2 to 3 Datakit pull cycles.
Celery queue backlog	P0	`worker_queue_count` continuously exceeds threshold, or `celery_queue_oldest_wait_seconds` continuously exceeds business-acceptable wait time.
Worker suspected not consuming	P0	`celery_task_published_total` increases, but `celery_task_started_total` shows no growth for a long time, while queue length or oldest wait time rises.
Worker suspected stuck	P0	`celery_task_active` > 0 for a long time and doesn't decrease, `celery_task_finished_total` doesn't grow, task duration P99 continuously rises.
Beat missed scheduling	P0	`celery_beat_missed=1`, or `celery_beat_lag_seconds` exceeds task acceptable threshold.
Celery task failure rate increase	P1	Proportion of `celery_task_finished_total{status!="success"}` exceeds threshold for consecutive multiple cycles.
Celery retry storm	P1	`celery_task_retry_total` increases consecutively, concentrated on the same `task` or `exception_type`.
Business task no success for a long time	P0/P1	Critical task hasn't updated `business_task_last_success_timestamp_seconds` for a long time.
DB pool near exhaustion	P1	`dependency_db_pool_connections{state="checked_out"}` approaches `state="size"`, or `state="overflow" > 0` appears continuously.

Common DQL Examples¶

View current backlog per queue:

M::`df_studio`:(max(`worker_queue_count`)) BY `queue`

View oldest task wait time per queue:

M::`df_studio`:(max(`celery_queue_oldest_wait_seconds`)) BY `queue`

View difference between task publish and start execution:

M::`df_studio`:(sum(`celery_task_published_total`), sum(`celery_task_started_total`)) BY `queue`,`task`

View task failure exception TopN:

M::`df_studio`:(sum(`celery_task_failure_exception_total`)) BY `task`,`exception_type`

Check if beat missed scheduling:

M::`df_studio`:(max(`celery_beat_missed`), max(`celery_beat_lag_seconds`)) BY `beat_name`,`task`

View self-monitoring export status:

M::`df_studio`:(max(`self_monitor_export_total`), max(`self_monitor_export_points_total`), max(`self_monitor_export_duration_seconds`)) BY `exporter`,`result`

View business task last success time:

M::`df_studio`:(max(`business_task_last_success_timestamp_seconds`)) BY `domain`,`task`

Relationship with Existing Self-Monitoring Documentation¶

For the complete self-monitoring deployment process for the Deployment Plan, please refer to the document "Enabling Observability for the Deployment Plan Itself" in the same directory. That document covers general steps like DataKit deployment, Prometheus pull configuration, APM, RUM, Synthetic Tests, Monitors, and template import. This document only supplements the df_studio metrics output by the Studio backend itself, configuration switches, tag units, and monitoring criteria for asynchronous tasks/Redis/Broker.