Skip to content

Studio Self-Monitoring Configuration and Metrics Explanation

This document explains how to confirm whether self-monitoring configuration is enabled for the Deployment Plan Studio, and details the metrics, tags, units, and monitoring recommendations related to APIs, Celery asynchronous tasks, Redis/Broker, business tasks, and the export pipeline within the self-monitoring Measurement df_studio.

Applicable Versions

  • The self-monitoring active metrics capability is provided starting from the release version on May 20, 2026.
  • The release version on May 13, 2026 does not yet support this active metrics capability.
  • It has been confirmed in the Lark issue ticket that the latest Deployment Plan v1.130.225 supports this capability. This version corresponds to the current Studio system commit 60a71d992. The metrics and configurations in this document have been verified against this commit.
  • If the environment is below v1.130.225, it is recommended to upgrade first and then configure.

Collection Pipeline

The Studio application side does not actively push metrics to external services. The recommended pipeline is:

Studio API / Celery / WebSocket / Snapshot
  -> Lightweight metric recording within the application
  -> Redis metric cache
  -> inner /metrics Prometheus text endpoint
  -> Datakit periodic pull
  -> Self-monitoring Workspace
  -> Dashboard / Monitor / Alert

Datakit pull address:

http://<inner-service-ip>:5000/api/v1/inner/metrics?from=datakit&type=df_studio

The Prometheus text endpoint outputs the full metric name, e.g., df_studio_celery_task_published_total. In the Guance UI or DQL, queries are typically performed by "Measurement + field", i.e., the Measurement is df_studio, and the field is celery_task_published_total.

How to Check if Self-Monitoring Configuration is Enabled

1. Check Studio Backend Configuration

The Studio backend configuration item is SelfMonitorMetricsSet. It is disabled by default. Users only need to explicitly enable enable:

SelfMonitorMetricsSet:
  enable: true

Other configurations can remain at their defaults. Their meanings are as follows:

Configuration Item Default Value Unit Description
enable false Boolean Unified switch for self-monitoring. Only when set to true will metrics related to APIs, Celery, business tasks, and /metrics export be recorded.
expireSeconds 3600 seconds Retention window for periodic incremental metrics in Redis.
stateExpireSeconds 604800 seconds Retention window for stateful metrics like beat last publish time, business task last success/failure.
beatMissedLagThresholdSeconds 300 seconds Default lag threshold for determining if a beat execution hasn't started after publishing.
beatMissedIntervalMultiplier 2 multiplier Multiplier of the recent publish interval used to determine if a low-frequency beat missed scheduling.
celeryQueues celery, correlation_task, snapshot_queue, compute_task list Celery queues for which queue length and oldest wait time need to be read.

It can also be overridden via environment variables:

STUDIO__SelfMonitorMetricsSet__enable=true

Note: enable must be a boolean semantic true or false. Invalid strings or null will cause configuration loading to fail.

2. Check if /metrics Outputs Self-Monitoring Metrics

Access the inner service within the cluster:

curl 'http://management-backend.forethought-core:5000/api/v1/inner/metrics?from=datakit&type=df_studio'

If enabled and exporting normally, the response should contain content similar to:

df_studio_self_monitor_export_total{exporter="prometheus_inner",result="success"} 1
df_studio_self_monitor_export_duration_seconds{exporter="prometheus_inner",result="success"} ...
df_studio_self_monitor_export_last_success_timestamp_seconds{exporter="prometheus_inner"} ...

If an exception occurs during export, the interface will fail-open, still attempting to return failure metrics:

df_studio_self_monitor_export_total{exporter="prometheus_inner",result="failure"} 1
df_studio_self_monitor_export_error_total{exception_type="...",exporter="prometheus_inner"} 1
df_studio_self_monitor_export_last_failure_timestamp_seconds{exception_type="...",exporter="prometheus_inner"} ...

3. Check Historical Health Check Interface

The management backend still retains the Celery worker health check interface:

curl 'http://management-backend.forethought-core:5000/api/v1/const/celery/ping'

This interface reads celery_active_point from Redis and returns the last active time for each queue. A 200 response indicates there is an active point within the configured valid offset time. A 400 response usually indicates that the corresponding worker hasn't updated its active point for a long time, which may be due to the worker not running, task backlog, Redis/Broker connection issues, etc.

This interface is suitable as a compatible health check. For complete self-monitoring, it is recommended to prioritize using the df_studio metrics described below.

Metrics and Tag Conventions

Global Tags

Tag Applicable Scope Meaning Common Values Usage Suggestions
service API Service entry name front, inner, openapi, admin, external, center, aiapi, sse Low cardinality, suitable for overviews.
run_app_code API Current process run entry Same as service Low cardinality, useful for distinguishing entries.
route_rule API Flask route rule /api/v1/... More suitable for aggregation than raw URLs.
method API HTTP method GET, POST, PUT Low cardinality.
status_class API HTTP status code family 2xx, 4xx, 5xx Used for success rate, error rate.
queue Celery Celery queue name celery, correlation_task, snapshot_queue, compute_task Low cardinality, core dimension for asynchronous task overview.
task Celery / Business tasks Celery task name or business task name forethought.tasks..., statistics_upload Medium cardinality, used for task-level troubleshooting.
status Celery Task end status success, failure, retry Used for task quality analysis.
exception_type Celery / Export pipeline Exception type TimeoutError, OperationalError Used for exception TopN.
beat_name Celery beat Beat entry name Beat entry name from configuration Used to determine if scheduled tasks missed scheduling.
domain Business tasks Business domain archive_report, incidents, billing, cleanup Low cardinality, primary dimension for business task overview.
result Business tasks / Export pipeline Execution result success, error, failure, partial_success, skipped Used for success and failure rates.
item_type Business tasks Processed object type workspace, report_task, notification Low cardinality.
reason Business tasks Partial failure reason notify_failed, item_error Can be used for alerts after controlled enumeration.
entry Independent entry Non-Flask entry websocket, snapshot Used for independent entry health.
event Independent entry Entry event connect, disconnect, send_task Used for entry event analysis.
state Current state metrics State name size, checked_out, overflow Specific meaning depends on the metric.
exporter /metrics export Exporter name prometheus_inner Low cardinality.
le Histogram bucket Bucket upper bound 0.1, 1, 5, +Inf Used only for _bucket metrics to calculate percentiles.

le represents the less-than-or-equal-to upper bound of a histogram bucket, not a business dimension. For example, le="1" indicates the cumulative count of samples less than or equal to 1 second, le="+Inf" indicates the total number of all samples.

API Metrics

Metric Field Unit Tags Meaning
api_request_count count service, api_path Compatible with old API non-5xx request count.
api_request_error_count count service, api_path Compatible with old API 5xx request count.
api_requests_total count service, run_app_code, route_rule, method, status_class Total API request count, periodic increment.
api_errors_total count service, run_app_code, route_rule, method, status_class, error_type API error count, currently mainly covering HTTP 5xx.
api_duration_seconds_bucket seconds service, run_app_code, route_rule, method, status_class, le API request duration distribution.
api_duration_seconds_sum seconds service, run_app_code, route_rule, method, status_class Sum of API request durations.
api_duration_seconds_count count service, run_app_code, route_rule, method, status_class Number of API request duration samples.

Celery Queue and Task Metrics

The following metrics have been written via Celery signals in commit 60a71d992 and are exported by the df_studio Measurement. worker_queue_count and celery_queue_oldest_wait_seconds directly read the Redis broker queue, used to detect Redis/Broker queue backlog or worker non-consumption. Celery task lifecycle metrics are used to further distinguish between "not started consuming" and "stuck after starting".

Metric Field Unit Tags Meaning
worker_queue_count count queue Current length of the Redis broker queue.
celery_queue_oldest_wait_seconds seconds queue Wait time from publish to current for the oldest task in the queue.
celery_task_published_total count task, queue Number of Celery task publications.
celery_task_started_total count task, queue Number of Celery task execution starts.
celery_task_finished_total count task, queue, status Number of Celery task completions, distinguished by status.
celery_task_active count task, queue Number of Celery tasks currently executing.
celery_task_duration_seconds_bucket seconds task, queue, le Task execution duration distribution.
celery_task_duration_seconds_sum seconds task, queue Sum of task execution durations.
celery_task_duration_seconds_count count task, queue Number of task execution duration samples.
celery_task_queue_wait_seconds_bucket seconds task, queue, le Distribution of queue wait time from task publish to execution start.
celery_task_queue_wait_seconds_sum seconds task, queue Sum of task queue wait times.
celery_task_queue_wait_seconds_count count task, queue Number of task queue wait time samples.
celery_task_failure_exception_total count task, queue, exception_type Distribution of task failure exception types.
celery_task_timeout_total count task, queue, timeout_type Number of Celery soft/hard timeout occurrences.
celery_task_retry_total count task, queue, exception_type Number of task retries.
celery_task_retry_delay_seconds_bucket seconds task, queue, le Task retry delay distribution.
celery_task_retry_delay_seconds_sum seconds task, queue Sum of task retry delays.
celery_task_retry_delay_seconds_count count task, queue Number of task retry delay samples.

Beat and Scheduled Task Metrics

Metric Field Unit Tags Meaning
celery_beat_task_last_publish_timestamp_seconds Unix seconds beat_name, task Last publish time of the beat entry task.
celery_beat_task_last_started_timestamp_seconds Unix seconds beat_name, task Last execution start time of the beat entry's corresponding task.
celery_beat_lag_seconds seconds beat_name, task Lag from beat task publish to worker execution start.
celery_beat_publish_interval_seconds seconds beat_name, task Actual interval between the last two publishes of the beat entry.
celery_beat_missed boolean beat_name, task Whether scheduling is suspected to be missed, 1 indicates suspected missed scheduling.

Business Task Metrics

Metric Field Unit Tags Meaning
business_task_runs_total count domain, task, result Number of business task runs.
business_task_items_total count domain, task, item_type, result Number of objects processed by business tasks.
business_task_duration_seconds_bucket seconds domain, task, result, le End-to-end duration distribution of business tasks.
business_task_duration_seconds_sum seconds domain, task, result Sum of end-to-end durations of business tasks.
business_task_duration_seconds_count count domain, task, result Number of end-to-end duration samples for business tasks.
business_task_last_success_timestamp_seconds Unix seconds domain, task Last successful time of the business task.
business_task_last_failure_timestamp_seconds Unix seconds domain, task, exception_type Last failure time of the business task.
business_task_partial_failure_total count domain, task, reason Number of times a task did not fail overall but had partial failures.

Currently integrated business domains include:

domain Typical Tasks Focus Points
archive_report Archive report v2/v3, first-cycle notification, delayed notification Whether report triggering, screenshots, notifications are successful, presence of partial failures.
incidents Incident duty policy analysis, incident queue sync, incident notification sending Whether the incident notification pipeline is successful, presence of backlog.
billing Billing statistics reporting Whether on time, successful, number of workspaces processed.
workspace_usage OpenAPI API Key usage database refresh Whether usage refresh is successful, number of buckets and access keys processed.
cleanup Dashboard history cleanup, etc. Whether cleanup tasks are failing long-term or skipped.
sync_config Integration template synchronization Whether configuration synchronization is successful.
notification Status Page status change notification Whether notification tasks succeed or fail.
keyevent Critical event unresolved asynchronous query Whether critical event asynchronous queries are abnormal.
cloud_collector Cloud collector asynchronous operations Asynchronous operation splitting, lock waiting, success/failure.
catalog Unified catalog entity health Whether entity health tasks are on time, successful, and processing volume is abnormal.
snapshot Dashboard screenshot, chart screenshot, chart data generation Snapshot service screenshot/chart data task results.

Independent Entry and Dependency Health Metrics

Metric Field Unit Tags Meaning
service_entry_events_total count entry, event, result Number of events for non-Flask entries like WebSocket, snapshot.
service_entry_active count/boolean entry, state Current active state of non-Flask entries.
dependency_db_pool_connections count pool, state Current state of the database connection pool in the exporter's process, state includes size, checked_in, checked_out, overflow.
self_monitor_export_total count exporter, result Result of this /metrics export.
self_monitor_export_points_total count exporter, result Number of Prometheus samples successfully exported in this /metrics export.
self_monitor_export_duration_seconds seconds exporter, result Duration of this /metrics export.
self_monitor_export_last_success_timestamp_seconds Unix seconds exporter Last successful export time.
self_monitor_export_last_failure_timestamp_seconds Unix seconds exporter, exception_type Last fail-open failure export time.
self_monitor_export_error_total count exporter, exception_type This fail-open failure event.

Asynchronous Task and Redis/Broker Monitoring Recommendations

Customer concerns like "are asynchronous tasks abnormal, is Redis disconnected, are workers stuck" cannot be judged by a single metric. It is recommended to use combined conditions.

Scenario Priority Observation Metrics Recommended Dimensions Interpretation Method
Worker not consuming or insufficient consumption capacity worker_queue_count, celery_queue_oldest_wait_seconds, celery_task_published_total, celery_task_started_total queue, task Queue length and oldest wait time continuously rising, published increasing but started very low, usually indicates worker not consuming, insufficient consumption, or connection issues with broker.
Redis/Broker readable but worker disconnected worker_queue_count, celery_queue_oldest_wait_seconds, celery_task_active queue Exporter can read the queue, queue backlog rising, but active is 0 long-term or significantly low, suspect worker-side disconnection, hang, or not started.
Redis/Broker completely unavailable or exporter read failure self_monitor_export_total, self_monitor_export_error_total, self_monitor_export_last_failure_timestamp_seconds, self_monitor_export_points_total exporter, exception_type If /metrics fails-open, failure time refreshes, sample count drops significantly, it indicates the collection pipeline itself may have failed to access Redis, DB, or metric source.
Task starts but gets stuck and doesn't finish celery_task_active, celery_task_started_total, celery_task_finished_total, celery_task_duration_seconds_bucket queue, task Active doesn't decrease for a long time, started increases but finished doesn't, or P99 duration continuously rises, indicating tasks may be stuck on external calls, locks, DB, or loop logic.
Task failure or retry storm celery_task_finished_total, celery_task_failure_exception_total, celery_task_retry_total, celery_task_retry_delay_seconds_bucket task, exception_type Failure/retry both rising, and exception types concentrated, indicates tasks may have entered a failure-retry loop.
Beat publishes normally but worker doesn't start celery_beat_task_last_publish_timestamp_seconds, celery_beat_task_last_started_timestamp_seconds, celery_beat_lag_seconds, celery_beat_missed beat_name, task last_publish updates but last_started doesn't, lag rises or missed=1, indicates scheduled task delivered but worker hasn't started consuming.
Beat stops publishing or low-frequency task misses scheduling celery_beat_publish_interval_seconds, celery_beat_task_last_publish_timestamp_seconds, celery_beat_missed beat_name, task publish interval exceeds historical period or last_publish too old, indicates beat may have stopped, configuration not enabled, or scheduler abnormal.
Business task overall success but partial object failure business_task_partial_failure_total, business_task_items_total, business_task_runs_total domain, task, reason, item_type partial failure increases but overall task may still be partial_success, need to look at specific business object failure reasons.
Business task no success for a long time business_task_last_success_timestamp_seconds, business_task_last_failure_timestamp_seconds, business_task_runs_total domain, task last_success too far from current time, and last_failure updates or runs have no success, indicates this business pipeline may be failing silently.

It is recommended to at least set up the following alerts:

Alert Item Suggested Severity Suggested Condition
Self-monitoring export failure P0 self_monitor_export_total{result="failure"} appears or self_monitor_export_error_total appears.
Self-monitoring no success for a long time P0 Current time minus self_monitor_export_last_success_timestamp_seconds exceeds 2 to 3 Datakit pull cycles.
Celery queue backlog P0 worker_queue_count continuously exceeds threshold, or celery_queue_oldest_wait_seconds continuously exceeds business-acceptable wait time.
Worker suspected not consuming P0 celery_task_published_total increases, but celery_task_started_total shows no growth for a long time, while queue length or oldest wait time rises.
Worker suspected stuck P0 celery_task_active > 0 for a long time and doesn't decrease, celery_task_finished_total doesn't grow, task duration P99 continuously rises.
Beat missed scheduling P0 celery_beat_missed=1, or celery_beat_lag_seconds exceeds task acceptable threshold.
Celery task failure rate increase P1 Proportion of celery_task_finished_total{status!="success"} exceeds threshold for consecutive multiple cycles.
Celery retry storm P1 celery_task_retry_total increases consecutively, concentrated on the same task or exception_type.
Business task no success for a long time P0/P1 Critical task hasn't updated business_task_last_success_timestamp_seconds for a long time.
DB pool near exhaustion P1 dependency_db_pool_connections{state="checked_out"} approaches state="size", or state="overflow" > 0 appears continuously.

Common DQL Examples

View current backlog per queue:

M::`df_studio`:(max(`worker_queue_count`)) BY `queue`

View oldest task wait time per queue:

M::`df_studio`:(max(`celery_queue_oldest_wait_seconds`)) BY `queue`

View difference between task publish and start execution:

M::`df_studio`:(sum(`celery_task_published_total`), sum(`celery_task_started_total`)) BY `queue`,`task`

View task failure exception TopN:

M::`df_studio`:(sum(`celery_task_failure_exception_total`)) BY `task`,`exception_type`

Check if beat missed scheduling:

M::`df_studio`:(max(`celery_beat_missed`), max(`celery_beat_lag_seconds`)) BY `beat_name`,`task`

View self-monitoring export status:

M::`df_studio`:(max(`self_monitor_export_total`), max(`self_monitor_export_points_total`), max(`self_monitor_export_duration_seconds`)) BY `exporter`,`result`

View business task last success time:

M::`df_studio`:(max(`business_task_last_success_timestamp_seconds`)) BY `domain`,`task`

Relationship with Existing Self-Monitoring Documentation

For the complete self-monitoring deployment process for the Deployment Plan, please refer to the document "Enabling Observability for the Deployment Plan Itself" in the same directory. That document covers general steps like DataKit deployment, Prometheus pull configuration, APM, RUM, Synthetic Tests, Monitors, and template import. This document only supplements the df_studio metrics output by the Studio backend itself, configuration switches, tag units, and monitoring criteria for asynchronous tasks/Redis/Broker.

Feedback

Is this page helpful? ×