OpenTelemetry Observability¶
Problems to Solve for Building Observability¶
How to link front-end and back-end in tracing?
How to associate logs and metrics with trace data?
-
OpenTelemetry implements SDKs for different languages. The front-end tracing is mainly achieved through
opentelemetry-js
, and the back-end also has implementations for relevant languages such as Java, Go, Python, etc. Trace information from each language is uniformly reported to opentelemetry-collector (hereinafter referred to asotel-collector
). -
Taking Java as an example, opentelemetry-java (hereinafter referred to as "Agent") is injected into the application via the
javaagent
method. After the application generates trace information, by setting MDC, the traceId and spanId can be passed as parameters to the log, so that the log will carry the traceId and spanId when output.
Mapping Diagnostic Context (MDC) is
a tool used to distinguish interleaved log outputs from different sources.— log4j MDC Documentation
It contains thread-local context information, which is later copied to each log event captured by the logging library.
The OTel Java agent injects several pieces of information about the current span into each log record event's MDC copy:
trace_id
- the current trace id (the same asSpan.current().getSpanContext().getTraceId()
);span_id
- the current span id (the same asSpan.current().getSpanContext().getSpanId()
);trace_flags
- the current trace flags, formatted according to the W3C Trace Flags format (the same asSpan.current().getSpanContext().getTraceFlags().asHex()
).
These three pieces of information can be included in log statements generated by the logging library by specifying them in the pattern/format.
Tip: For Spring Boot configurations using logback, you can add MDC to the log lines by simply overriding the following content for logging.pattern.level
:
logging.pattern.level = trace_id=%mdc{trace_id} span_id=%mdc{span_id} trace_flags=%mdc{trace_flags} %5p
This way, any service or tool that parses application logs can associate traces/spans with log statements.
- OpenTelemetry also supports metric collection, exporting metrics via otel-collector to corresponding exporters, such as Prometheus, and then displaying them via Grafana. The otlpExporter supports metric output, and the association between metric, log, and trace can be done using the
tag
server.name
.
The original intention of OpenTelemetry is to unify data formats, which indicates that for a long period, OpenTelemetry does not plan to focus on observability products. Everyone still uses OpenTelemetry as a data transfer station or adopts the data standard of OpenTelemetry to constrain their own observability products.
End-to-End Full Tracing Observability Construction Based on OpenTelemetry¶
The following introduces three methods for constructing end-to-end full tracing observability based on OpenTelemetry:
1. Based on Traditional Monitoring Aggregation¶
Mainly through otel-collector pushing logs, metrics, and traces respectively to ELK, Prometheus, and related APM vendors like Jaeger.
2. Based on Grafana Suite¶
In recent years, Grafana has also ventured into the observability field, establishing Grafana-Cloud and Grafana Labs, and launched its own set of solutions for observability. Grafana Tempo is an open-source, easy-to-use, and scalable distributed tracing backend. Tempo is cost-effective, requiring only object storage to run, and is deeply integrated with Grafana, Prometheus, and Loki. Tempo can work with any open-source tracing protocol, including Jaeger, Zipkin, and OpenTelemetry, so Tempo can directly receive trace data from OpenTelemetry, Loki collects log data from OpenTelemetry, and Grafana still uses Prometheus to receive metric data.
Although the above two solutions solve the data format problem, in a certain sense, they can only be called technologies rather than products. They are essentially stitched together from open-source tools. When encountering some business issues, it is still necessary to access different tools to check and analyze problems. Logs, metrics, and traces are not well integrated, and do not reduce the operation and communication costs for operations and development personnel. A unified data analysis platform for logs, metrics, and traces becomes particularly important. Although Grafana is continuously striving towards this direction, it has not completely solved the data silo problem. Different structured data still uses different query languages. Currently, Grafana has achieved associating log data with trace data, but trace data cannot be reversely associated with log data. The Grafana team still needs to work on solving the interrelated query analysis of data.
3. Based on Guance - Commercial Observability Product¶
Guance is a unified data collection and management platform for various types of data, including metric data, log data, APM, RUM, infrastructure, containers, middleware, network performance, etc. Using Guance allows us to observe applications comprehensively, not just between log chains.
DataKit is the gateway前置 of Guance. To send data to Guance, you need to configure DataKit correctly. Using DataKit has the following advantages:
- In the host environment, each host has a datakit, and data is first sent to the local datakit, cached and preprocessed by datakit before being reported, avoiding network fluctuations while providing edge processing capabilities, reducing pressure on the backend data processing.
- In the k8 environment, each node has a DataKit daemonset. By utilizing the local traffic mechanism of k8s, data from pods on each node is first sent to the local node’s DataKit, avoiding network fluctuations while adding pod and node tags to apm data, making it easier to locate in distributed environments.
The design concept of DataKit also draws inspiration from OpenTelemetry and is compatible with the oltp protocol, so it can bypass the collector and send data directly to DataKit, or set the exporter of the collector to oltp (DataKit).
Solution Comparison¶
Scenario | Open Source Self-built Products | Using Guance |
---|---|---|
Building Cloud-era Monitoring Systems | At least 3 months of investment from a professional technical team, and this is just the beginning | Ready to use within 30 minutes |
Related Cost Investment | Even a simple open-source monitoring product requires a hardware investment exceeding $20,000/year, and if it's a cloud-era observability platform, at least $100,000/year fixed investment (estimated by cloud hardware) | Pay-as-you-go, fees are elastic based on actual business conditions, overall costs are more than 50% lower than the comprehensive investment in using open-source products |
System Maintenance Management | Requires long-term attention and investment from professional engineers, and the mixed use of multiple open-source products increases management complexity | No need to worry, focus on business issues |
Number of Probes to Install on Servers | Each open-source software requires a probe, consuming large amounts of server performance | One probe, fully binary-based, extremely low CPU and memory usage |
Value Brought | Only depends on the ability of the company's own engineers and their research into open-source products | A comprehensive data platform, full observability, enabling engineers to use data to solve problems |
Root Cause Analysis of Performance and Failures | Only relies on the team's own capabilities | Quickly locates issues based on data analysis |
Security | Various mixed open-source software, testing the comprehensive skills of technical engineers | Comprehensive security scanning and testing, customer-side code open-sourced to users, and timely iterative updates of the product ensure security |
Scalability and Services | Needs to build its own SRE engineering team | Provides professional services, equivalent to configuring an external SRE support team |
Training and Support | Hire external teachers | Long-term online training support |