OpenTelemetry Observability¶
Issues to Address in Building Observability¶
How to link frontend and backend tracing?
How to associate trace data with corresponding logs and metrics?
-
OpenTelemetry has implemented SDKs for different programming languages. Frontend tracing is mainly achieved through
opentelemetry-js
, while the backend has implementations for various languages such as Java, Go, Python, etc. Each language reports its trace information to the opentelemetry-collector (hereinafter referred to asotel-collector
). -
Taking the Java language as an example, opentelemetry-java (hereinafter referred to as "Agent") is injected into applications via the
javaagent
method. After the application generates trace information, setting MDC can pass thetraceId
andspanId
as parameters to the log. This way, the log will include thetraceId
andspanId
when it is output.
The Mapping Diagnostic Context (MDC) is
a tool used to distinguish interleaved log outputs from different sources. — log4j MDC documentation
It contains thread-local context information that is later copied into each log event captured by the logging library.
The OTel Java agent injects several pieces of information about the current span into each log record's MDC copy:
trace_id
- the current trace ID (same asSpan.current().getSpanContext().getTraceId()
);span_id
- the current span ID (same asSpan.current().getSpanContext().getSpanId()
);trace_flags
- the current trace flags, formatted according to the W3C Trace Context format (same asSpan.current().getSpanContext().getTraceFlags().asHex()
).
These three pieces of information can be included in log statements generated by the logging library by specifying them in the pattern/format.
Tip: For Spring Boot configurations using logback, you can add MDC to log lines by overriding only the following content logging.pattern.level
:
logging.pattern.level = trace_id=%mdc{trace_id} span_id=%mdc{span_id} trace_flags=%mdc{trace_flags} %5p
This way, any service or tool parsing application logs can correlate traces/spans with log statements.
- OpenTelemetry also supports metric collection, exporting metrics to corresponding exporters via the otel-collector, such as Prometheus, and then displaying them through Grafana. The otlpExporter supports metric export, and metrics can be associated with logs and traces using the
tag
server.name
.
The original intention of OpenTelemetry is to unify data formats, which means that for a long time, OpenTelemetry does not plan to focus on observability products. Instead, it serves as a data transit station or uses OpenTelemetry's data standards to constrain its own observability products.
End-to-End Full-Chain Observability Construction with OpenTelemetry¶
Below are three approaches to building end-to-end full-chain observability using OpenTelemetry:
1. Based on Traditional Monitoring Aggregation¶
This approach primarily involves using the otel-collector to push logs, metrics, and traces to ELK, Prometheus, and related APM vendors like Jaeger.
2. Based on the Grafana Stack¶
In recent years, Grafana has ventured into the observability domain, establishing Grafana Cloud and Grafana Labs, and offering its own set of solutions. Grafana Tempo is an open-source, easy-to-use, and scalable distributed tracing backend. Tempo is cost-effective, requiring only object storage to run, and integrates deeply with Grafana, Prometheus, and Loki. Tempo can directly receive trace data from OpenTelemetry, while Loki collects log data from OpenTelemetry, and Grafana continues to use Prometheus for metric data.
Although these two solutions address data format issues, they can only be considered technical tools rather than products. They are essentially Frankenstein-like combinations of open-source tools. When encountering business issues, users still need to access different tools to analyze problems, leading to poor integration of logs, metrics, and traces, which does not reduce operational and communication costs for operations and development personnel. A unified platform for log, metric, and trace data analysis is crucial. While Grafana is making efforts in this direction, it has not fully solved data silos; different data structures still use different query languages. Currently, Grafana can correlate log data with trace data, but trace data cannot be reverse-correlated with log data. The Grafana team still needs to work on mutual correlation and analysis between data types.
3. Based on Guance - Commercial Observability Product¶
Guance is a unified platform for collecting and managing metrics, logs, APM, RUM, infrastructure, containers, middleware, network performance, and other types of data. Using Guance allows for comprehensive observability of applications, not just log tracing.
DataKit is the gateway for Guance. To send data to Guance, DataKit must be correctly configured. DataKit offers the following advantages:
- In host environments, each host has a DataKit instance. Data is first sent to the local DataKit, cached, pre-processed, and then reported, avoiding network jitter while providing edge processing capabilities, reducing backend data processing pressure.
- In k8s environments, each node has a DataKit daemonset. By leveraging k8s' local traffic mechanism, data from pods on each node is first sent to the local node's DataKit, avoiding network jitter while adding pod and node labels to APM data, facilitating localization in distributed environments.
DataKit's design philosophy also draws inspiration from OpenTelemetry, supporting oltp protocols, so data can bypass the collector and be sent directly to DataKit, or the collector's exporter can be set to oltp (DataKit).
Solution Comparison¶
Scenario | Self-hosted Open Source Products | Using Guance |
---|---|---|
Building a Cloud-era Monitoring System | Requires at least 3 months of investment from a professional technical team, and that’s just the beginning | Ready-to-use within 30 minutes |
Cost Investment | Even a simple open-source monitoring product requires hardware investment exceeding $20,000/year, and cloud-era observability platforms require at least $100,000/year fixed investment (estimated based on cloud hardware) | Pay-as-you-go, with costs elastic and overall expenses more than 50% lower than using open-source products |
System Maintenance and Management | Requires continuous attention and investment from professional engineers, and mixed use of multiple open-source products increases management complexity | No need for concern, allowing focus on business issues |
Number of Probes Needed on Servers | Each open-source software requires its own probe, consuming significant server performance | One probe, fully binary-based, with minimal CPU and memory usage |
Value Delivered | Depends solely on the company's engineers' capabilities and their proficiency with open-source products | Comprehensive data platform, full observability, enabling engineers to solve problems using data |
Root Cause Analysis of Performance and Failures | Relies solely on the team's capabilities | Rapid root cause identification based on data analysis |
Security | Various mixed open-source software, testing the comprehensive skills of technical engineers | Comprehensive security scanning and testing, customer-side code open-sourced to users, and timely product updates ensuring security |
Scalability and Services | Requires building an SRE engineer team | Provides professional services, equivalent to having an external SRE support team |
Training and Support | Hiring external instructors | Long-term online training and support |