Studio Self-Monitoring Configuration and Metrics Explanation¶
This document explains how to confirm whether self-monitoring configuration is enabled for the Deployment Plan Studio, and details the metrics, tags, units, and monitoring recommendations related to APIs, Celery asynchronous tasks, Redis/Broker, business tasks, and the export pipeline within the self-monitoring Measurement df_studio.
Applicable Versions¶
- The self-monitoring active metrics capability is provided starting from the release version on May 20, 2026.
- The release version on May 13, 2026 does not yet support this active metrics capability.
- It has been confirmed in the Lark issue ticket that the latest Deployment Plan
v1.130.225supports this capability. This version corresponds to the current Studio system commit60a71d992. The metrics and configurations in this document have been verified against this commit. - If the environment is below
v1.130.225, it is recommended to upgrade first and then configure.
Collection Pipeline¶
The Studio application side does not actively push metrics to external services. The recommended pipeline is:
Studio API / Celery / WebSocket / Snapshot
-> Lightweight metric recording within the application
-> Redis metric cache
-> inner /metrics Prometheus text endpoint
-> Datakit periodic pull
-> Self-monitoring Workspace
-> Dashboard / Monitor / Alert
Datakit pull address:
The Prometheus text endpoint outputs the full metric name, e.g., df_studio_celery_task_published_total. In the Guance UI or DQL, queries are typically performed by "Measurement + field", i.e., the Measurement is df_studio, and the field is celery_task_published_total.
How to Check if Self-Monitoring Configuration is Enabled¶
1. Check Studio Backend Configuration¶
The Studio backend configuration item is SelfMonitorMetricsSet. It is disabled by default. Users only need to explicitly enable enable:
Other configurations can remain at their defaults. Their meanings are as follows:
| Configuration Item | Default Value | Unit | Description |
|---|---|---|---|
enable |
false |
Boolean | Unified switch for self-monitoring. Only when set to true will metrics related to APIs, Celery, business tasks, and /metrics export be recorded. |
expireSeconds |
3600 |
seconds | Retention window for periodic incremental metrics in Redis. |
stateExpireSeconds |
604800 |
seconds | Retention window for stateful metrics like beat last publish time, business task last success/failure. |
beatMissedLagThresholdSeconds |
300 |
seconds | Default lag threshold for determining if a beat execution hasn't started after publishing. |
beatMissedIntervalMultiplier |
2 |
multiplier | Multiplier of the recent publish interval used to determine if a low-frequency beat missed scheduling. |
celeryQueues |
celery, correlation_task, snapshot_queue, compute_task |
list | Celery queues for which queue length and oldest wait time need to be read. |
It can also be overridden via environment variables:
Note: enable must be a boolean semantic true or false. Invalid strings or null will cause configuration loading to fail.
2. Check if /metrics Outputs Self-Monitoring Metrics¶
Access the inner service within the cluster:
curl 'http://management-backend.forethought-core:5000/api/v1/inner/metrics?from=datakit&type=df_studio'
If enabled and exporting normally, the response should contain content similar to:
df_studio_self_monitor_export_total{exporter="prometheus_inner",result="success"} 1
df_studio_self_monitor_export_duration_seconds{exporter="prometheus_inner",result="success"} ...
df_studio_self_monitor_export_last_success_timestamp_seconds{exporter="prometheus_inner"} ...
If an exception occurs during export, the interface will fail-open, still attempting to return failure metrics:
df_studio_self_monitor_export_total{exporter="prometheus_inner",result="failure"} 1
df_studio_self_monitor_export_error_total{exception_type="...",exporter="prometheus_inner"} 1
df_studio_self_monitor_export_last_failure_timestamp_seconds{exception_type="...",exporter="prometheus_inner"} ...
3. Check Historical Health Check Interface¶
The management backend still retains the Celery worker health check interface:
This interface reads celery_active_point from Redis and returns the last active time for each queue. A 200 response indicates there is an active point within the configured valid offset time. A 400 response usually indicates that the corresponding worker hasn't updated its active point for a long time, which may be due to the worker not running, task backlog, Redis/Broker connection issues, etc.
This interface is suitable as a compatible health check. For complete self-monitoring, it is recommended to prioritize using the df_studio metrics described below.
Metrics and Tag Conventions¶
Global Tags¶
| Tag | Applicable Scope | Meaning | Common Values | Usage Suggestions |
|---|---|---|---|---|
service |
API | Service entry name | front, inner, openapi, admin, external, center, aiapi, sse |
Low cardinality, suitable for overviews. |
run_app_code |
API | Current process run entry | Same as service |
Low cardinality, useful for distinguishing entries. |
route_rule |
API | Flask route rule | /api/v1/... |
More suitable for aggregation than raw URLs. |
method |
API | HTTP method | GET, POST, PUT |
Low cardinality. |
status_class |
API | HTTP status code family | 2xx, 4xx, 5xx |
Used for success rate, error rate. |
queue |
Celery | Celery queue name | celery, correlation_task, snapshot_queue, compute_task |
Low cardinality, core dimension for asynchronous task overview. |
task |
Celery / Business tasks | Celery task name or business task name | forethought.tasks..., statistics_upload |
Medium cardinality, used for task-level troubleshooting. |
status |
Celery | Task end status | success, failure, retry |
Used for task quality analysis. |
exception_type |
Celery / Export pipeline | Exception type | TimeoutError, OperationalError |
Used for exception TopN. |
beat_name |
Celery beat | Beat entry name | Beat entry name from configuration | Used to determine if scheduled tasks missed scheduling. |
domain |
Business tasks | Business domain | archive_report, incidents, billing, cleanup |
Low cardinality, primary dimension for business task overview. |
result |
Business tasks / Export pipeline | Execution result | success, error, failure, partial_success, skipped |
Used for success and failure rates. |
item_type |
Business tasks | Processed object type | workspace, report_task, notification |
Low cardinality. |
reason |
Business tasks | Partial failure reason | notify_failed, item_error |
Can be used for alerts after controlled enumeration. |
entry |
Independent entry | Non-Flask entry | websocket, snapshot |
Used for independent entry health. |
event |
Independent entry | Entry event | connect, disconnect, send_task |
Used for entry event analysis. |
state |
Current state metrics | State name | size, checked_out, overflow |
Specific meaning depends on the metric. |
exporter |
/metrics export |
Exporter name | prometheus_inner |
Low cardinality. |
le |
Histogram bucket | Bucket upper bound | 0.1, 1, 5, +Inf |
Used only for _bucket metrics to calculate percentiles. |
le represents the less-than-or-equal-to upper bound of a histogram bucket, not a business dimension. For example, le="1" indicates the cumulative count of samples less than or equal to 1 second, le="+Inf" indicates the total number of all samples.
API Metrics¶
| Metric Field | Unit | Tags | Meaning |
|---|---|---|---|
api_request_count |
count | service, api_path |
Compatible with old API non-5xx request count. |
api_request_error_count |
count | service, api_path |
Compatible with old API 5xx request count. |
api_requests_total |
count | service, run_app_code, route_rule, method, status_class |
Total API request count, periodic increment. |
api_errors_total |
count | service, run_app_code, route_rule, method, status_class, error_type |
API error count, currently mainly covering HTTP 5xx. |
api_duration_seconds_bucket |
seconds | service, run_app_code, route_rule, method, status_class, le |
API request duration distribution. |
api_duration_seconds_sum |
seconds | service, run_app_code, route_rule, method, status_class |
Sum of API request durations. |
api_duration_seconds_count |
count | service, run_app_code, route_rule, method, status_class |
Number of API request duration samples. |
Celery Queue and Task Metrics¶
The following metrics have been written via Celery signals in commit 60a71d992 and are exported by the df_studio Measurement. worker_queue_count and celery_queue_oldest_wait_seconds directly read the Redis broker queue, used to detect Redis/Broker queue backlog or worker non-consumption. Celery task lifecycle metrics are used to further distinguish between "not started consuming" and "stuck after starting".
| Metric Field | Unit | Tags | Meaning |
|---|---|---|---|
worker_queue_count |
count | queue |
Current length of the Redis broker queue. |
celery_queue_oldest_wait_seconds |
seconds | queue |
Wait time from publish to current for the oldest task in the queue. |
celery_task_published_total |
count | task, queue |
Number of Celery task publications. |
celery_task_started_total |
count | task, queue |
Number of Celery task execution starts. |
celery_task_finished_total |
count | task, queue, status |
Number of Celery task completions, distinguished by status. |
celery_task_active |
count | task, queue |
Number of Celery tasks currently executing. |
celery_task_duration_seconds_bucket |
seconds | task, queue, le |
Task execution duration distribution. |
celery_task_duration_seconds_sum |
seconds | task, queue |
Sum of task execution durations. |
celery_task_duration_seconds_count |
count | task, queue |
Number of task execution duration samples. |
celery_task_queue_wait_seconds_bucket |
seconds | task, queue, le |
Distribution of queue wait time from task publish to execution start. |
celery_task_queue_wait_seconds_sum |
seconds | task, queue |
Sum of task queue wait times. |
celery_task_queue_wait_seconds_count |
count | task, queue |
Number of task queue wait time samples. |
celery_task_failure_exception_total |
count | task, queue, exception_type |
Distribution of task failure exception types. |
celery_task_timeout_total |
count | task, queue, timeout_type |
Number of Celery soft/hard timeout occurrences. |
celery_task_retry_total |
count | task, queue, exception_type |
Number of task retries. |
celery_task_retry_delay_seconds_bucket |
seconds | task, queue, le |
Task retry delay distribution. |
celery_task_retry_delay_seconds_sum |
seconds | task, queue |
Sum of task retry delays. |
celery_task_retry_delay_seconds_count |
count | task, queue |
Number of task retry delay samples. |
Beat and Scheduled Task Metrics¶
| Metric Field | Unit | Tags | Meaning |
|---|---|---|---|
celery_beat_task_last_publish_timestamp_seconds |
Unix seconds | beat_name, task |
Last publish time of the beat entry task. |
celery_beat_task_last_started_timestamp_seconds |
Unix seconds | beat_name, task |
Last execution start time of the beat entry's corresponding task. |
celery_beat_lag_seconds |
seconds | beat_name, task |
Lag from beat task publish to worker execution start. |
celery_beat_publish_interval_seconds |
seconds | beat_name, task |
Actual interval between the last two publishes of the beat entry. |
celery_beat_missed |
boolean | beat_name, task |
Whether scheduling is suspected to be missed, 1 indicates suspected missed scheduling. |
Business Task Metrics¶
| Metric Field | Unit | Tags | Meaning |
|---|---|---|---|
business_task_runs_total |
count | domain, task, result |
Number of business task runs. |
business_task_items_total |
count | domain, task, item_type, result |
Number of objects processed by business tasks. |
business_task_duration_seconds_bucket |
seconds | domain, task, result, le |
End-to-end duration distribution of business tasks. |
business_task_duration_seconds_sum |
seconds | domain, task, result |
Sum of end-to-end durations of business tasks. |
business_task_duration_seconds_count |
count | domain, task, result |
Number of end-to-end duration samples for business tasks. |
business_task_last_success_timestamp_seconds |
Unix seconds | domain, task |
Last successful time of the business task. |
business_task_last_failure_timestamp_seconds |
Unix seconds | domain, task, exception_type |
Last failure time of the business task. |
business_task_partial_failure_total |
count | domain, task, reason |
Number of times a task did not fail overall but had partial failures. |
Currently integrated business domains include:
domain |
Typical Tasks | Focus Points |
|---|---|---|
archive_report |
Archive report v2/v3, first-cycle notification, delayed notification | Whether report triggering, screenshots, notifications are successful, presence of partial failures. |
incidents |
Incident duty policy analysis, incident queue sync, incident notification sending | Whether the incident notification pipeline is successful, presence of backlog. |
billing |
Billing statistics reporting | Whether on time, successful, number of workspaces processed. |
workspace_usage |
OpenAPI API Key usage database refresh | Whether usage refresh is successful, number of buckets and access keys processed. |
cleanup |
Dashboard history cleanup, etc. | Whether cleanup tasks are failing long-term or skipped. |
sync_config |
Integration template synchronization | Whether configuration synchronization is successful. |
notification |
Status Page status change notification | Whether notification tasks succeed or fail. |
keyevent |
Critical event unresolved asynchronous query | Whether critical event asynchronous queries are abnormal. |
cloud_collector |
Cloud collector asynchronous operations | Asynchronous operation splitting, lock waiting, success/failure. |
catalog |
Unified catalog entity health | Whether entity health tasks are on time, successful, and processing volume is abnormal. |
snapshot |
Dashboard screenshot, chart screenshot, chart data generation | Snapshot service screenshot/chart data task results. |
Independent Entry and Dependency Health Metrics¶
| Metric Field | Unit | Tags | Meaning |
|---|---|---|---|
service_entry_events_total |
count | entry, event, result |
Number of events for non-Flask entries like WebSocket, snapshot. |
service_entry_active |
count/boolean | entry, state |
Current active state of non-Flask entries. |
dependency_db_pool_connections |
count | pool, state |
Current state of the database connection pool in the exporter's process, state includes size, checked_in, checked_out, overflow. |
self_monitor_export_total |
count | exporter, result |
Result of this /metrics export. |
self_monitor_export_points_total |
count | exporter, result |
Number of Prometheus samples successfully exported in this /metrics export. |
self_monitor_export_duration_seconds |
seconds | exporter, result |
Duration of this /metrics export. |
self_monitor_export_last_success_timestamp_seconds |
Unix seconds | exporter |
Last successful export time. |
self_monitor_export_last_failure_timestamp_seconds |
Unix seconds | exporter, exception_type |
Last fail-open failure export time. |
self_monitor_export_error_total |
count | exporter, exception_type |
This fail-open failure event. |
Asynchronous Task and Redis/Broker Monitoring Recommendations¶
Customer concerns like "are asynchronous tasks abnormal, is Redis disconnected, are workers stuck" cannot be judged by a single metric. It is recommended to use combined conditions.
| Scenario | Priority Observation Metrics | Recommended Dimensions | Interpretation Method |
|---|---|---|---|
| Worker not consuming or insufficient consumption capacity | worker_queue_count, celery_queue_oldest_wait_seconds, celery_task_published_total, celery_task_started_total |
queue, task |
Queue length and oldest wait time continuously rising, published increasing but started very low, usually indicates worker not consuming, insufficient consumption, or connection issues with broker. |
| Redis/Broker readable but worker disconnected | worker_queue_count, celery_queue_oldest_wait_seconds, celery_task_active |
queue |
Exporter can read the queue, queue backlog rising, but active is 0 long-term or significantly low, suspect worker-side disconnection, hang, or not started. |
| Redis/Broker completely unavailable or exporter read failure | self_monitor_export_total, self_monitor_export_error_total, self_monitor_export_last_failure_timestamp_seconds, self_monitor_export_points_total |
exporter, exception_type |
If /metrics fails-open, failure time refreshes, sample count drops significantly, it indicates the collection pipeline itself may have failed to access Redis, DB, or metric source. |
| Task starts but gets stuck and doesn't finish | celery_task_active, celery_task_started_total, celery_task_finished_total, celery_task_duration_seconds_bucket |
queue, task |
Active doesn't decrease for a long time, started increases but finished doesn't, or P99 duration continuously rises, indicating tasks may be stuck on external calls, locks, DB, or loop logic. |
| Task failure or retry storm | celery_task_finished_total, celery_task_failure_exception_total, celery_task_retry_total, celery_task_retry_delay_seconds_bucket |
task, exception_type |
Failure/retry both rising, and exception types concentrated, indicates tasks may have entered a failure-retry loop. |
| Beat publishes normally but worker doesn't start | celery_beat_task_last_publish_timestamp_seconds, celery_beat_task_last_started_timestamp_seconds, celery_beat_lag_seconds, celery_beat_missed |
beat_name, task |
last_publish updates but last_started doesn't, lag rises or missed=1, indicates scheduled task delivered but worker hasn't started consuming. |
| Beat stops publishing or low-frequency task misses scheduling | celery_beat_publish_interval_seconds, celery_beat_task_last_publish_timestamp_seconds, celery_beat_missed |
beat_name, task |
publish interval exceeds historical period or last_publish too old, indicates beat may have stopped, configuration not enabled, or scheduler abnormal. |
| Business task overall success but partial object failure | business_task_partial_failure_total, business_task_items_total, business_task_runs_total |
domain, task, reason, item_type |
partial failure increases but overall task may still be partial_success, need to look at specific business object failure reasons. |
| Business task no success for a long time | business_task_last_success_timestamp_seconds, business_task_last_failure_timestamp_seconds, business_task_runs_total |
domain, task |
last_success too far from current time, and last_failure updates or runs have no success, indicates this business pipeline may be failing silently. |
It is recommended to at least set up the following alerts:
| Alert Item | Suggested Severity | Suggested Condition |
|---|---|---|
| Self-monitoring export failure | P0 | self_monitor_export_total{result="failure"} appears or self_monitor_export_error_total appears. |
| Self-monitoring no success for a long time | P0 | Current time minus self_monitor_export_last_success_timestamp_seconds exceeds 2 to 3 Datakit pull cycles. |
| Celery queue backlog | P0 | worker_queue_count continuously exceeds threshold, or celery_queue_oldest_wait_seconds continuously exceeds business-acceptable wait time. |
| Worker suspected not consuming | P0 | celery_task_published_total increases, but celery_task_started_total shows no growth for a long time, while queue length or oldest wait time rises. |
| Worker suspected stuck | P0 | celery_task_active > 0 for a long time and doesn't decrease, celery_task_finished_total doesn't grow, task duration P99 continuously rises. |
| Beat missed scheduling | P0 | celery_beat_missed=1, or celery_beat_lag_seconds exceeds task acceptable threshold. |
| Celery task failure rate increase | P1 | Proportion of celery_task_finished_total{status!="success"} exceeds threshold for consecutive multiple cycles. |
| Celery retry storm | P1 | celery_task_retry_total increases consecutively, concentrated on the same task or exception_type. |
| Business task no success for a long time | P0/P1 | Critical task hasn't updated business_task_last_success_timestamp_seconds for a long time. |
| DB pool near exhaustion | P1 | dependency_db_pool_connections{state="checked_out"} approaches state="size", or state="overflow" > 0 appears continuously. |
Common DQL Examples¶
View current backlog per queue:
View oldest task wait time per queue:
View difference between task publish and start execution:
M::`df_studio`:(sum(`celery_task_published_total`), sum(`celery_task_started_total`)) BY `queue`,`task`
View task failure exception TopN:
Check if beat missed scheduling:
View self-monitoring export status:
M::`df_studio`:(max(`self_monitor_export_total`), max(`self_monitor_export_points_total`), max(`self_monitor_export_duration_seconds`)) BY `exporter`,`result`
View business task last success time:
Relationship with Existing Self-Monitoring Documentation¶
For the complete self-monitoring deployment process for the Deployment Plan, please refer to the document "Enabling Observability for the Deployment Plan Itself" in the same directory. That document covers general steps like DataKit deployment, Prometheus pull configuration, APM, RUM, Synthetic Tests, Monitors, and template import. This document only supplements the df_studio metrics output by the Studio backend itself, configuration switches, tag units, and monitoring criteria for asynchronous tasks/Redis/Broker.