OpenTelemetry Observability¶

Problems to Solve for Building Observability¶

How to link front-end and back-end in tracing?

How to associate logs and metrics with trace data?

OpenTelemetry implements SDKs for different languages. The front-end tracing is mainly achieved through opentelemetry-js, and the back-end also has implementations for relevant languages such as Java, Go, Python, etc. Trace information from each language is uniformly reported to opentelemetry-collector (hereinafter referred to as otel-collector).
Taking Java as an example, opentelemetry-java (hereinafter referred to as "Agent") is injected into the application via the javaagent method. After the application generates trace information, by setting MDC, the traceId and spanId can be passed as parameters to the log, so that the log will carry the traceId and spanId when output.

Mapping Diagnostic Context (MDC) is

a tool used to distinguish interleaved log outputs from different sources.— log4j MDC Documentation

It contains thread-local context information, which is later copied to each log event captured by the logging library.

The OTel Java agent injects several pieces of information about the current span into each log record event's MDC copy:

trace_id - the current trace id (the same as Span.current().getSpanContext().getTraceId());

span_id - the current span id (the same as Span.current().getSpanContext().getSpanId());

trace_flags - the current trace flags, formatted according to the W3C Trace Flags format (the same as Span.current().getSpanContext().getTraceFlags().asHex()).

These three pieces of information can be included in log statements generated by the logging library by specifying them in the pattern/format.

Tip: For Spring Boot configurations using logback, you can add MDC to the log lines by simply overriding the following content for logging.pattern.level:

logging.pattern.level = trace_id=%mdc{trace_id} span_id=%mdc{span_id} trace_flags=%mdc{trace_flags} %5p

This way, any service or tool that parses application logs can associate traces/spans with log statements.

OpenTelemetry also supports metric collection, exporting metrics via otel-collector to corresponding exporters, such as Prometheus, and then displaying them via Grafana. The otlpExporter supports metric output, and the association between metric, log, and trace can be done using the tag server.name.

The original intention of OpenTelemetry is to unify data formats, which indicates that for a long period, OpenTelemetry does not plan to focus on observability products. Everyone still uses OpenTelemetry as a data transfer station or adopts the data standard of OpenTelemetry to constrain their own observability products.

End-to-End Full Tracing Observability Construction Based on OpenTelemetry¶

The following introduces three methods for constructing end-to-end full tracing observability based on OpenTelemetry:

1. Based on Traditional Monitoring Aggregation ¶

Mainly through otel-collector pushing logs, metrics, and traces respectively to ELK, Prometheus, and related APM vendors like Jaeger.

2. Based on Grafana Suite ¶

In recent years, Grafana has also ventured into the observability field, establishing Grafana-Cloud and Grafana Labs, and launched its own set of solutions for observability. Grafana Tempo is an open-source, easy-to-use, and scalable distributed tracing backend. Tempo is cost-effective, requiring only object storage to run, and is deeply integrated with Grafana, Prometheus, and Loki. Tempo can work with any open-source tracing protocol, including Jaeger, Zipkin, and OpenTelemetry, so Tempo can directly receive trace data from OpenTelemetry, Loki collects log data from OpenTelemetry, and Grafana still uses Prometheus to receive metric data.

Although the above two solutions solve the data format problem, in a certain sense, they can only be called technologies rather than products. They are essentially stitched together from open-source tools. When encountering some business issues, it is still necessary to access different tools to check and analyze problems. Logs, metrics, and traces are not well integrated, and do not reduce the operation and communication costs for operations and development personnel. A unified data analysis platform for logs, metrics, and traces becomes particularly important. Although Grafana is continuously striving towards this direction, it has not completely solved the data silo problem. Different structured data still uses different query languages. Currently, Grafana has achieved associating log data with trace data, but trace data cannot be reversely associated with log data. The Grafana team still needs to work on solving the interrelated query analysis of data.

3. Based on Guance - Commercial Observability Product ¶

Guance is a unified data collection and management platform for various types of data, including metric data, log data, APM, RUM, infrastructure, containers, middleware, network performance, etc. Using Guance allows us to observe applications comprehensively, not just between log chains.

DataKit is the gateway前置 of Guance. To send data to Guance, you need to configure DataKit correctly. Using DataKit has the following advantages:

In the host environment, each host has a datakit, and data is first sent to the local datakit, cached and preprocessed by datakit before being reported, avoiding network fluctuations while providing edge processing capabilities, reducing pressure on the backend data processing.

In the k8 environment, each node has a DataKit daemonset. By utilizing the local traffic mechanism of k8s, data from pods on each node is first sent to the local node’s DataKit, avoiding network fluctuations while adding pod and node tags to apm data, making it easier to locate in distributed environments.

The design concept of DataKit also draws inspiration from OpenTelemetry and is compatible with the oltp protocol, so it can bypass the collector and send data directly to DataKit, or set the exporter of the collector to oltp (DataKit).

Solution Comparison¶

Scenario	Open Source Self-built Products	Using Guance
Building Cloud-era Monitoring Systems	At least 3 months of investment from a professional technical team, and this is just the beginning	Ready to use within 30 minutes
Related Cost Investment	Even a simple open-source monitoring product requires a hardware investment exceeding $20,000/year, and if it's a cloud-era observability platform, at least $100,000/year fixed investment (estimated by cloud hardware)	Pay-as-you-go, fees are elastic based on actual business conditions, overall costs are more than 50% lower than the comprehensive investment in using open-source products
System Maintenance Management	Requires long-term attention and investment from professional engineers, and the mixed use of multiple open-source products increases management complexity	No need to worry, focus on business issues
Number of Probes to Install on Servers	Each open-source software requires a probe, consuming large amounts of server performance	One probe, fully binary-based, extremely low CPU and memory usage
Value Brought	Only depends on the ability of the company's own engineers and their research into open-source products	A comprehensive data platform, full observability, enabling engineers to use data to solve problems
Root Cause Analysis of Performance and Failures	Only relies on the team's own capabilities	Quickly locates issues based on data analysis
Security	Various mixed open-source software, testing the comprehensive skills of technical engineers	Comprehensive security scanning and testing, customer-side code open-sourced to users, and timely iterative updates of the product ensure security
Scalability and Services	Needs to build its own SRE engineering team	Provides professional services, equivalent to configuring an external SRE support team
Training and Support	Hire external teachers	Long-term online training support

OpenTelemetry Observability¶

Problems to Solve for Building Observability¶

End-to-End Full Tracing Observability Construction Based on OpenTelemetry¶

1. Based on Traditional Monitoring Aggregation¶

2. Based on Grafana Suite¶

3. Based on Guance - Commercial Observability Product¶

Solution Comparison¶

Is this page helpful? ×

1. Based on Traditional Monitoring Aggregation ¶

2. Based on Grafana Suite ¶

3. Based on Guance - Commercial Observability Product ¶