Skip to content

OpenTelemetry Observability


Problems to Solve for Building Observability

  1. How to link front-end and back-end in tracing?

  2. How to associate logs and metrics with trace data?

  1. OpenTelemetry implements SDKs for different languages. The front-end tracing is mainly achieved through opentelemetry-js, and the back-end also has implementations for relevant languages such as Java, Go, Python, etc. Trace information from each language is uniformly reported to opentelemetry-collector (hereinafter referred to as otel-collector).

  2. Taking Java as an example, opentelemetry-java (hereinafter referred to as "Agent") is injected into the application via the javaagent method. After the application generates trace information, by setting MDC, the traceId and spanId can be passed as parameters to the log, so that the log will carry the traceId and spanId when output.

Mapping Diagnostic Context (MDC) is

a tool used to distinguish interleaved log outputs from different sources.— log4j MDC Documentation

It contains thread-local context information, which is later copied to each log event captured by the logging library.

The OTel Java agent injects several pieces of information about the current span into each log record event's MDC copy:

  • trace_id - the current trace id (the same as Span.current().getSpanContext().getTraceId());
  • span_id - the current span id (the same as Span.current().getSpanContext().getSpanId());
  • trace_flags - the current trace flags, formatted according to the W3C Trace Flags format (the same as Span.current().getSpanContext().getTraceFlags().asHex()).

These three pieces of information can be included in log statements generated by the logging library by specifying them in the pattern/format.

Tip: For Spring Boot configurations using logback, you can add MDC to the log lines by simply overriding the following content for logging.pattern.level:

logging.pattern.level = trace_id=%mdc{trace_id} span_id=%mdc{span_id} trace_flags=%mdc{trace_flags} %5p

This way, any service or tool that parses application logs can associate traces/spans with log statements.

  1. OpenTelemetry also supports metric collection, exporting metrics via otel-collector to corresponding exporters, such as Prometheus, and then displaying them via Grafana. The otlpExporter supports metric output, and the association between metric, log, and trace can be done using the tag server.name.

The original intention of OpenTelemetry is to unify data formats, which indicates that for a long period, OpenTelemetry does not plan to focus on observability products. Everyone still uses OpenTelemetry as a data transfer station or adopts the data standard of OpenTelemetry to constrain their own observability products.

End-to-End Full Tracing Observability Construction Based on OpenTelemetry

The following introduces three methods for constructing end-to-end full tracing observability based on OpenTelemetry:

1. Based on Traditional Monitoring Aggregation

Mainly through otel-collector pushing logs, metrics, and traces respectively to ELK, Prometheus, and related APM vendors like Jaeger.

2. Based on Grafana Suite

In recent years, Grafana has also ventured into the observability field, establishing Grafana-Cloud and Grafana Labs, and launched its own set of solutions for observability. Grafana Tempo is an open-source, easy-to-use, and scalable distributed tracing backend. Tempo is cost-effective, requiring only object storage to run, and is deeply integrated with Grafana, Prometheus, and Loki. Tempo can work with any open-source tracing protocol, including Jaeger, Zipkin, and OpenTelemetry, so Tempo can directly receive trace data from OpenTelemetry, Loki collects log data from OpenTelemetry, and Grafana still uses Prometheus to receive metric data.

Although the above two solutions solve the data format problem, in a certain sense, they can only be called technologies rather than products. They are essentially stitched together from open-source tools. When encountering some business issues, it is still necessary to access different tools to check and analyze problems. Logs, metrics, and traces are not well integrated, and do not reduce the operation and communication costs for operations and development personnel. A unified data analysis platform for logs, metrics, and traces becomes particularly important. Although Grafana is continuously striving towards this direction, it has not completely solved the data silo problem. Different structured data still uses different query languages. Currently, Grafana has achieved associating log data with trace data, but trace data cannot be reversely associated with log data. The Grafana team still needs to work on solving the interrelated query analysis of data.

3. Based on Guance - Commercial Observability Product

Guance is a unified data collection and management platform for various types of data, including metric data, log data, APM, RUM, infrastructure, containers, middleware, network performance, etc. Using Guance allows us to observe applications comprehensively, not just between log chains.

image.png

DataKit is the gateway前置 of Guance. To send data to Guance, you need to configure DataKit correctly. Using DataKit has the following advantages:

  1. In the host environment, each host has a datakit, and data is first sent to the local datakit, cached and preprocessed by datakit before being reported, avoiding network fluctuations while providing edge processing capabilities, reducing pressure on the backend data processing.
  2. In the k8 environment, each node has a DataKit daemonset. By utilizing the local traffic mechanism of k8s, data from pods on each node is first sent to the local node’s DataKit, avoiding network fluctuations while adding pod and node tags to apm data, making it easier to locate in distributed environments.

The design concept of DataKit also draws inspiration from OpenTelemetry and is compatible with the oltp protocol, so it can bypass the collector and send data directly to DataKit, or set the exporter of the collector to oltp (DataKit).

Solution Comparison

Scenario Open Source Self-built Products Using Guance
Building Cloud-era Monitoring Systems At least 3 months of investment from a professional technical team, and this is just the beginning Ready to use within 30 minutes
Related Cost Investment Even a simple open-source monitoring product requires a hardware investment exceeding $20,000/year, and if it's a cloud-era observability platform, at least $100,000/year fixed investment (estimated by cloud hardware) Pay-as-you-go, fees are elastic based on actual business conditions, overall costs are more than 50% lower than the comprehensive investment in using open-source products
System Maintenance Management Requires long-term attention and investment from professional engineers, and the mixed use of multiple open-source products increases management complexity No need to worry, focus on business issues
Number of Probes to Install on Servers Each open-source software requires a probe, consuming large amounts of server performance One probe, fully binary-based, extremely low CPU and memory usage
Value Brought Only depends on the ability of the company's own engineers and their research into open-source products A comprehensive data platform, full observability, enabling engineers to use data to solve problems
Root Cause Analysis of Performance and Failures Only relies on the team's own capabilities Quickly locates issues based on data analysis
Security Various mixed open-source software, testing the comprehensive skills of technical engineers Comprehensive security scanning and testing, customer-side code open-sourced to users, and timely iterative updates of the product ensure security
Scalability and Services Needs to build its own SRE engineering team Provides professional services, equivalent to configuring an external SRE support team
Training and Support Hire external teachers Long-term online training support

Feedback

Is this page helpful? ×