跳转至

OpenTelemetry


OpenTelemetry (以下简称 OTEL)是 CNCF 的一个可观测性项目,旨在提供可观测性领域的标准化方案,解决观测数据的数据模型、采集、处理、导出等的标准化问题。

OTEL 是一组标准和工具的集合,旨在管理观测类数据,如 trace、metrics、logs 。

本篇旨在介绍如何在 DataKit 上配置并开启 OTEL 的数据接入,以及 Java、Go 的最佳实践。

配置

进入 DataKit 安装目录下的 conf.d/opentelemetry 目录,复制 opentelemetry.conf.sample 并命名为 opentelemetry.conf。示例如下:

[[inputs.opentelemetry]]
  ## customer_tags will work as a whitelist to prevent tags send to data center.
  ## All . will replace to _ ,like this :
  ## "project.name" to send to GuanCe center is "project_name"
  # customer_tags = ["sink_project", "custom.otel.tag"]

  ## Keep rare tracing resources list switch.
  ## If some resources are rare enough(not presend in 1 hour), those resource will always send
  ## to data center and do not consider samplers and filters.
  # keep_rare_resource = false

  ## By default every error presents in span will be send to data center and omit any filters or
  ## sampler. If you want to get rid of some error status, you can set the error status list here.
  # omit_err_status = ["404"]

  ## compatible ddtrace: It is possible to compatible OTEL Trace with DDTrace trace
  # compatible_ddtrace=false

  ## spilt service.name form xx.system.
  ## see: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/database/database-spans.md
  spilt_service_name = true

  ## delete trace message
  # del_message = true

  ## logging message data max length,default is 500kb
  log_max = 500

  ## Ignore tracing resources map like service:[resources...].
  ## The service name is the full service name in current application.
  ## The resource list is regular expressions uses to block resource names.
  ## If you want to block some resources universally under all services, you can set the
  ## service name as "*". Note: double quotes "" cannot be omitted.
  # [inputs.opentelemetry.close_resource]
    # service1 = ["resource1", "resource2", ...]
    # service2 = ["resource1", "resource2", ...]
    # "*" = ["close_resource_under_all_services"]
    # ...

  ## Sampler config uses to set global sampling strategy.
  ## sampling_rate used to set global sampling rate.
  # [inputs.opentelemetry.sampler]
    # sampling_rate = 1.0

  # [inputs.opentelemetry.tags]
    # key1 = "value1"
    # key2 = "value2"
    # ...

  ## Threads config controls how many goroutines an agent cloud start to handle HTTP request.
  ## buffer is the size of jobs' buffering of worker channel.
  ## threads is the total number fo goroutines at running time.
  # [inputs.opentelemetry.threads]
    # buffer = 100
    # threads = 8

  ## Storage config a local storage space in hard dirver to cache trace data.
  ## path is the local file path used to cache data.
  ## capacity is total space size(MB) used to store data.
  # [inputs.opentelemetry.storage]
    # path = "./otel_storage"
    # capacity = 5120

  ## OTEL agent HTTP config for trace and metrics
  ## If enable set to be true, trace and metrics will be received on path respectively, by default is:
  ## trace : /otel/v1/traces
  ## metric: /otel/v1/metrics
  ## and the client side should be configured properly with Datakit listening port(default: 9529)
  ## or custom HTTP request path.
  ## for example http://127.0.0.1:9529/otel/v1/traces
  ## The acceptable http_status_ok values will be 200 or 202.
  [inputs.opentelemetry.http]
   http_status_ok = 200
   trace_api = "/otel/v1/traces"
   metric_api = "/otel/v1/metrics"
   logs_api = "/otel/v1/logs"

  ## OTEL agent GRPC config for trace and metrics.
  ## GRPC services for trace and metrics can be enabled respectively as setting either to be true.
  ## add is the listening on address for GRPC server.
  [inputs.opentelemetry.grpc]
   addr = "127.0.0.1:4317"

  ## If 'expected_headers' is well configed, then the obligation of sending certain wanted HTTP headers is on the client side,
  ## otherwise HTTP status code 400(bad request) will be provoked.
  ## Note: expected_headers will be effected on both trace and metrics if setted up.
  # [inputs.opentelemetry.expected_headers]
  # ex_version = "1.2.3"
  # ex_name = "env_resource_name"
  # ...

配置好后,重启 DataKit 即可。

可通过 ConfigMap 方式注入采集器配置配置 ENV_DATAKIT_INPUTS 开启采集器。

也支持以环境变量的方式修改配置参数(需要在 ENV_DEFAULT_ENABLED_INPUTS 中加为默认采集器):

  • ENV_INPUT_OTEL_CUSTOMER_TAGS

    标签白名单

    字段类型: JSON

    采集器配置字段: customer_tags

    示例: [\"project_id\", \"custom.tag\"]

  • ENV_INPUT_OTEL_KEEP_RARE_RESOURCE

    保持稀有跟踪资源列表

    字段类型: Boolean

    采集器配置字段: keep_rare_resource

    默认值: false

  • ENV_INPUT_OTEL_COMPATIBLE_DD_TRACE

    将 trace_id 转成 10 进制,兼容 DDTrace

    字段类型: Boolean

    采集器配置字段: compatible_dd_trace

    默认值: false

  • ENV_INPUT_OTEL_SPILT_SERVICE_NAME

    从 span.Attributes 中获取 xx.system 去替换服务名

    字段类型: Boolean

    采集器配置字段: spilt_service_name

    默认值: false

  • ENV_INPUT_OTEL_DEL_MESSAGE

    删除 trace 消息

    字段类型: Boolean

    采集器配置字段: del_message

    默认值: false

  • ENV_INPUT_OTEL_OMIT_ERR_STATUS

    错误状态白名单

    字段类型: JSON

    采集器配置字段: omit_err_status

    示例: ["404", "403", "400"]

  • ENV_INPUT_OTEL_CLOSE_RESOURCE

    忽略指定服务器的 tracing(正则匹配)

    字段类型: JSON

    采集器配置字段: close_resource

    示例: {"service1":["resource1","other"],"service2":["resource2","other"]}

  • ENV_INPUT_OTEL_SAMPLER

    全局采样率

    字段类型: Float

    采集器配置字段: sampler

    示例: 0.3

  • ENV_INPUT_OTEL_THREADS

    线程和缓存的数量

    字段类型: JSON

    采集器配置字段: threads

    示例: {"buffer":1000, "threads":100}

  • ENV_INPUT_OTEL_STORAGE

    本地缓存路径和大小(MB)

    字段类型: JSON

    采集器配置字段: storage

    示例: {"storage":"./otel_storage", "capacity": 5120}

  • ENV_INPUT_OTEL_HTTP

    代理 HTTP 配置

    字段类型: JSON

    采集器配置字段: http

    示例: {"enable":true, "http_status_ok": 200, "trace_api": "/otel/v1/traces", "metric_api": "/otel/v1/metrics"}

  • ENV_INPUT_OTEL_GRPC

    代理 GRPC 配置

    字段类型: JSON

    采集器配置字段: grpc

    示例: {"trace_enable": true, "metric_enable": true, "addr": "127.0.0.1:4317"}

  • ENV_INPUT_OTEL_EXPECTED_HEADERS

    配置使用客户端的 HTTP 头

    字段类型: JSON

    采集器配置字段: expected_headers

    示例: {"ex_version": "1.2.3", "ex_name": "env_resource_name"}

  • ENV_INPUT_OTEL_TAGS

    自定义标签。如果配置文件有同名标签,将会覆盖它

    字段类型: JSON

    采集器配置字段: tags

    示例: {"k1":"v1", "k2":"v2", "k3":"v3"}

注意事项

  1. 建议使用 gRPC 协议,gRPC 具有压缩率高、序列化快、效率更高等优点
  2. DataKit 1.10.0 版本开始,http 协议的路由是可配置的,默认请求路径(Trace/Metric)分别为 /otel/v1/traces /otel/v1/logs 以及 /otel/v1/metrics
  3. 在涉及到 float/double 类型数据时,会最多保留两位小数
  4. HTTP 和 gRPC 都支持 gzip 压缩格式。在 exporter 中可配置环境变量来开启:OTEL_EXPORTER_OTLP_COMPRESSION = gzip, 默认是不会开启 gzip。
  5. HTTP 协议请求格式同时支持 JSON 和 Protobuf 两种序列化格式。但 gRPC 仅支持 Protobuf 一种。
Warning
  • DDTrace 链路数据中的服务名是根据服务名或者引用的三方库命名的,而 OTEL 采集器的服务名是按照 otel.service.name 定义的
  • 为了分开显示服务名,增加了一个字段配置:spilt_service_name = true
  • 服务名从链路数据的标签中取出,比如 DB 类型的标签 db.system=mysql 那么服务名就是 mysql。如果是消息队列类型,如 messaging.system=kafka,那么服务名就是 kafka
  • 默认从这三个标签中取出:db.system/rpc.system/messaging.system

使用 OTEL HTTP exporter 时注意环境变量的配置,由于 DataKit 的默认配置是 /otel/v1/traces /otel/v1/logs/otel/v1/metrics,所以想要使用 HTTP 协议的话,需要单独配置 tracemetric

Agent V2 版本

V2 版本默认使用 otlp exporter 将之前的 grpc 改为 http/protobuf , 可以通过命令 -Dotel.exporter.otlp.protocol=grpc 设置,或者使用默认的 http/protobuf

使用 http 的话,每个 exporter 路径需要显性配置 如:

java -javaagent:/usr/local/ddtrace/opentelemetry-javaagent-2.5.0.jar \
  -Dotel.exporter=otlp \
  -Dotel.exporter.otlp.protocol=http/protobuf \
  -Dotel.exporter.otlp.logs.endpoint=http://localhost:9529/otel/v1/logs \
  -Dotel.exporter.otlp.traces.endpoint=http://localhost:9529/otel/v1/traces \
  -Dotel.exporter.otlp.metrics.endpoint=http://localhost:9529/otel/v1/metrics \
  -Dotel.service.name=app \
  -jar app.jar

使用 gRPC 协议的话,必须是显式配置,否则就是默认的 http 协议:

java -javaagent:/usr/local/ddtrace/opentelemetry-javaagent-2.5.0.jar \
  -Dotel.exporter=otlp \
  -Dotel.exporter.otlp.protocol=grpc \
  -Dotel.exporter.otlp.endpoint=http://localhost:4317
  -Dotel.service.name=app \
  -jar app.jar

默认日志是开启的,要关闭日志采集的话,exporter 配置为空即可:-Dotel.logs.exporter=none

更多关于 V2 版本的重大修改请查看官方文档或者 GitHub 观测云版本说明: Github-GuanCe-v2.11.0

常规命令

ENV Command 说明 默认 注意
OTEL_SDK_DISABLED otel.sdk.disabled 关闭 SDK false 关闭后将不会产生任何链路指标信息
OTEL_RESOURCE_ATTRIBUTES otel.resource.attributes "service.name=App,username=liu" 每一个 span 中都会有该 tag 信息
OTEL_SERVICE_NAME otel.service.name 服务名,等效于上面 "service.name=App" 优先级高于上面
OTEL_LOG_LEVEL otel.log.level 日志级别 info
OTEL_PROPAGATORS otel.propagators 透传协议 tracecontext,baggage
OTEL_TRACES_SAMPLER otel.traces.sampler 采样 parentbased_always_on
OTEL_TRACES_SAMPLER_ARG otel.traces.sampler.arg 配合上面采样 参数 1.0 0 - 1.0
OTEL_EXPORTER_OTLP_PROTOCOL otel.exporter.otlp.protocol 协议包括: grpc,http/protobuf,http/json gRPC
OTEL_EXPORTER_OTLP_ENDPOINT otel.exporter.otlp.endpoint OTLP 地址 http://localhost:4317 http://datakit-endpoint:9529/otel/v1/traces
OTEL_TRACES_EXPORTER otel.traces.exporter 链路导出器 otlp
OTEL_LOGS_EXPORTER otel.logs.exporter 日志导出器 otlp OTEL V1 版本需要显式配置,否则默认不开启

您可以将 otel.javaagent.debug=true 参数传递给 Agent 以查看调试日志。请注意,这些日志内容相当冗长,生产环境下谨慎使用。

链路

Trace(链路)是由多个 span 组成的一条链路信息。 无论是单个服务还是一个服务集群,链路信息提供了一个请求发生到结束所经过的所有服务之间完整路径的集合。

DataKit 只接收 OTLP 的数据,OTLP 有三种数据类型: gRPChttp/protobufhttp/json ,具体配置可以参考:

# OpenTelemetry 默认采用 gPRC 协议发送到 DataKit
-Dotel.exporter=otlp \
-Dotel.exporter.otlp.protocol=grpc \
-Dotel.exporter.otlp.endpoint=http://datakit-endpoint:4317

# 使用 http/protobuf 方式
-Dotel.exporter=otlp \
-Dotel.exporter.otlp.protocol=http/protobuf \
-Dotel.exporter.otlp.traces.endpoint=http://datakit-endpoint:9529/otel/v1/traces \
-Dotel.exporter.otlp.metrics.endpoint=http://datakit-endpoint:9529/otel/v1/metrics 

# 使用 http/json 方式
-Dotel.exporter=otlp \
-Dotel.exporter.otlp.protocol=http/json \
-Dotel.exporter.otlp.traces.endpoint=http://datakit-endpoint:9529/otel/v1/traces \
-Dotel.exporter.otlp.metrics.endpoint=http://datakit-endpoint:9529/otel/v1/metrics

链路采样

可以采用头部采样或者尾部采样,具体可以查看两篇最佳实践:

Tag

从 DataKit 版本 1.22.0 开始,黑名单功能废弃。增加固定标签列表,只有在此列表中的才会提取到一级标签中,以下是固定列表:

Attributes tag 说明
http.url http_url HTTP 请求完整路径
http.hostname http_hostname hostname
http.route http_route 路由
http.status_code http_status_code 状态码
http.request.method http_request_method 请求方法
http.method http_method 同上
http.client_ip http_client_ip 客户端 IP
http.scheme http_scheme 请求协议
url.full url_full 请求全路径
url.scheme url_scheme 请求协议
url.path url_path 请求路径
url.query url_query 请求参数
span_kind span_kind span 类型
db.system db_system span 类型
db.operation db_operation DB 动作
db.name db_name 数据库名称
db.statement db_statement 详细信息
server.address server_address 服务地址
net.host.name net_host_name 请求的 host
server.port server_port 服务端口号
net.host.port net_host_port 同上
network.peer.address network_peer_address 网络地址
network.peer.port network_peer_port 网络端口
network.transport network_transport 协议
messaging.system messaging_system 消息队列名称
messaging.operation messaging_operation 消息动作
messaging.message messaging_message 消息
messaging.destination messaging_destination 消息详情
rpc.service rpc_service RPC 服务地址
rpc.system rpc_system RPC 服务名称
error error 是否错误
error.message error_message 错误信息
error.stack error_stack 堆栈信息
error.type error_type 错误类型
error.msg error_message 错误信息
project project project
version version 版本
env env 环境
host host Attributes 中的 host 标签
pod_name pod_name Attributes 中的 pod_name 标签
pod_namespace pod_namespace Attributes 中的 pod_namespace 标签

如果想要增加自定义标签,可使用环境变量:

# 通过启动参数添加自定义标签
-Dotel.resource.attributes=username=myName,env=1.1.0

并修改配置文件中的白名单,这样就可以在观测云的链路详情的一级标签出现自定义的标签。

customer_tags = ["sink_project", "username","env"]

Kind

所有的 Span 都有 span_kind 标签,共有 6 中属性:

  • unspecified: 未设置。
  • internal: 内部 span 或子 span 类型。
  • server: WEB 服务、RPC 服务 等等。
  • client: 客户端类型。
  • producer: 消息的生产者。
  • consumer: 消息的消费者。

指标

OpenTelemetry Java Agent 从应用程序中通过 JMX 协议获取 MBean 的指标信息,Java Agent 通过内部 SDK 报告选定的 JMX 指标,这意味着所有的指标都是可以配置的。

可以通过命令 otel.jmx.enabled=true/false 开启和关闭 JMX 指标采集,默认是开启的。

为了控制 MBean 检测尝试之间的时间间隔,可以使用 otel.jmx.discovery.delay 命令,该属性定义了在第一个和下一个检测周期之间通过的毫秒数。

另外 Agent 内置的一些三方软件的采集配置。具体可以参考: GitHub OTEL JMX Metric

Warning

从版本 DataKit 1.68.0 开始指标集名称做了改动: 所有发送到观测云的指标有一个统一的指标集的名字: otel_service 如果已经有了仪表板,将已有的仪表板导出后统一将 otel-serivce 改为 otel_service 再导入即可。

在将 Histogram 指标转到观测云的时候有些指标做了特殊处理:

  • OpenTelemetry 的直方图桶会被直接映射到 Prometheus 的直方图桶。
  • 每个桶的计数会被转换为 Prometheus 的累积计数格式。
  • 例如,OpenTelemetry 的桶 [0, 10)[10, 50)[50, 100) 会被转换为 Prometheus 的 _bucket 指标,并附带 le 标签:
  my_histogram_bucket{le="10"} 100
  my_histogram_bucket{le="50"} 200
  my_histogram_bucket{le="100"} 250
  • OpenTelemetry 直方图的总观测值数量会被转换为 Prometheus 的 _count 指标。
  • OpenTelemetry 直方图的总和会被转换为 Prometheus 的 _sum 指标,还会添加 _max _min
  my_histogram_count 250
  my_histogram_max 100
  my_histogram_min 50
  my_histogram_sum 12345.67

凡是以 _bucket 结尾的指标都是直方图数据,并且一定有 _max _min _count sum 结尾的指标。

在直方图数据中可以使用 le(less or equal) 标签进行分类,并且可以根据标签进行筛选,可以查看 OpenTelemetry Metrics 所有的指标和标签。

这种转换使得 OpenTelemetry 收集的直方图数据能够无缝集成到 Prometheus 中,并利用 Prometheus 的强大查询和可视化功能进行分析。

数据字段说明

otel_service

OpenTelemetry JVM Metrics

  • Tags
Tag Description
action GC Action
area Heap or not
cause GC Cause
container_id Container ID
db_host DB host name: ip or domain name
db_name Database name
db_system Database system name:mysql,oracle...
direction received or sent
exception Exception Information
gc GC Type
host Host Name
host_arch Host arch
host_name Host Name
http.scheme HTTP/HTTPS
http_method HTTP Method
http_request_method HTTP Method
http_response_status_code HTTP status code
http_route HTTP Route
id JVM Type
instrumentation_name Metric Name
jvm_gc_action action:end of major,end of minor GC
jvm_gc_name name:PS MarkSweep,PS Scavenge
jvm_memory_pool_name pool_name:code cache,PS Eden Space,PS Old Gen,MetaSpace...
jvm_memory_type memory type:heap,non_heap
jvm_thread_state Thread state:runnable,timed_waiting,waiting
le *_bucket: histogram metric explicit bounds
level Log Level
main-application-class Main Entry Point
method HTTP Type
name Thread Pool Name
net_protocol_name Net Protocol Name
net_protocol_version Net Protocol Version
os_type OS Type
outcome HTTP Outcome
path Disk Path
pool JVM Pool Type
scope_name Scope name
service_name Service Name
spanProcessorType Span Processor Type
state Thread State:idle,used
status HTTP Status Code
type Kafka broker type
unit metrics unit
uri HTTP Request URI
  • Metrics
Metric Description
application.ready.time Time taken (ms) for the application to be ready to service requests
Type: float
Unit: timeStamp,msec
application.started.time Time taken (ms) to start the application
Type: float
Unit: timeStamp,msec
disk.free Usable space for path
Type: float
Unit: digital,B
disk.total Total space for path
Type: float
Unit: digital,B
executor.active The approximate number of threads that are actively executing tasks
Type: float
Unit: count
executor.completed The approximate total number of tasks that have completed execution
Type: float
Unit: count
executor.pool.core The core number of threads for the pool
Type: float
Unit: digital,B
executor.pool.max The maximum allowed number of threads in the pool
Type: float
Unit: count
executor.pool.size The current number of threads in the pool
Type: float
Unit: digital,B
executor.queue.remaining The number of additional elements that this queue can ideally accept without blocking
Type: float
Unit: count
executor.queued The approximate number of tasks that are queued for execution
Type: float
Unit: count
http.server.active_requests The number of concurrent HTTP requests that are currently in-flight
Type: float
Unit: count
http.server.duration The duration of the inbound HTTP request
Type: float
Unit: time,ns
http.server.request.duration The count of HTTP request duration time in each bucket
Type: float
Unit: count
http.server.requests The http request count
Type: float
Unit: count
http.server.requests.max None
Type: float
Unit: digital,B
http.server.response.size The size of HTTP response messages
Type: float
Unit: digital,B
http.server.tomcat.errorCount The number of errors per second on all request processors
Type: float
Unit: count
http.server.tomcat.maxTime The longest request processing time
Type: float
Unit: timeStamp,msec
http.server.tomcat.processingTime Represents the total time for processing all requests
Type: float
Unit: timeStamp,msec
http.server.tomcat.requestCount The number of requests per second across all request processors
Type: float
Unit: count
http.server.tomcat.sessions.activeSessions The number of active sessions
Type: float
Unit: count
http.server.tomcat.threads Thread Count of the Thread Pool
Type: float
Unit: count
http.server.tomcat.traffic The number of bytes transmitted
Type: float
Unit: traffic,B/S
jvm.buffer.count An estimate of the number of buffers in the pool
Type: float
Unit: count
jvm.buffer.memory.used An estimate of the memory that the Java virtual machine is using for this buffer pool
Type: float
Unit: digital,B
jvm.buffer.total.capacity An estimate of the total capacity of the buffers in this pool
Type: float
Unit: digital,B
jvm.classes.loaded The number of classes that are currently loaded in the Java virtual machine
Type: float
Unit: count
jvm.classes.unloaded The total number of classes unloaded since the Java virtual machine has started execution
Type: float
Unit: count
jvm.gc.live.data.size Size of long-lived heap memory pool after reclamation
Type: float
Unit: digital,B
jvm.gc.max.data.size Max size of long-lived heap memory pool
Type: float
Unit: digital,B
jvm.gc.memory.allocated Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next
Type: float
Unit: digital,B
jvm.gc.memory.promoted Count of positive increases in the size of the old generation memory pool before GC to after GC
Type: float
Unit: digital,B
jvm.gc.overhead An approximation of the percent of CPU time used by GC activities over the last look back period or since monitoring began, whichever is shorter, in the range [0..1]
Type: int
Unit: count
jvm.gc.pause Time spent in GC pause
Type: float
Unit: timeStamp,nsec
jvm.gc.pause.max Time spent in GC pause
Type: float
Unit: timeStamp,msec
jvm.memory.committed The amount of memory in bytes that is committed for the Java virtual machine to use
Type: float
Unit: digital,B
jvm.memory.max The maximum amount of memory in bytes that can be used for memory management
Type: float
Unit: digital,B
jvm.memory.usage.after.gc The percentage of long-lived heap pool used after the last GC event, in the range [0..1]
Type: float
Unit: percent,percent
jvm.memory.used The amount of used memory
Type: float
Unit: digital,B
jvm.threads.daemon The current number of live daemon threads
Type: float
Unit: count
jvm.threads.live The current number of live threads including both daemon and non-daemon threads
Type: float
Unit: digital,B
jvm.threads.peak The peak live thread count since the Java virtual machine started or peak was reset
Type: float
Unit: digital,B
jvm.threads.states The current number of threads having NEW state
Type: float
Unit: digital,B
kafka.controller.active.count The number of controllers active on the broker
Type: float
Unit: count
kafka.isr.operation.count The number of in-sync replica shrink and expand operations
Type: float
Unit: count
kafka.lag.max The max lag in messages between follower and leader replicas
Type: float
Unit: timeStamp,msec
kafka.leaderElection.count The leader election count
Type: float
Unit: count
kafka.leaderElection.unclean.count Unclean leader election count - increasing indicates broker failures
Type: float
Unit: count
kafka.message.count The number of messages received by the broker
Type: float
Unit: count
kafka.network.io The bytes received or sent by the broker
Type: float
Unit: digital,B
kafka.partition.count The number of partitions on the broker
Type: float
Unit: count
kafka.partition.offline The number of partitions offline
Type: float
Unit: count
kafka.partition.underReplicated The number of under replicated partitions
Type: float
Unit: count
kafka.purgatory.size The number of requests waiting in purgatory
Type: float
Unit: count
kafka.request.count The number of requests received by the broker
Type: float
Unit: count
kafka.request.failed The number of requests to the broker resulting in a failure
Type: float
Unit: count
kafka.request.queue Size of the request queue
Type: float
Unit: count
kafka.request.time.50p The 50th percentile time the broker has taken to service requests
Type: float
Unit: timeStamp,msec
kafka.request.time.99p The 99th percentile time the broker has taken to service requests
Type: float
Unit: timeStamp,msec
kafka.request.time.total The total time the broker has taken to service requests
Type: float
Unit: timeStamp,msec
log4j2.events Number of fatal level log events
Type: float
Unit: count
otlp.exporter.exported OTLP exporter to remote
Type: int
Unit: count
otlp.exporter.seen OTLP exporter
Type: int
Unit: count
process.cpu.usage The "recent cpu usage" for the Java Virtual Machine process
Type: float
Unit: percent,percent
process.files.max The maximum file descriptor count
Type: float
Unit: count
process.files.open The open file descriptor count
Type: float
Unit: digital,B
process.runtime.jvm.buffer.count The number of buffers in the pool
Type: float
Unit: count
process.runtime.jvm.buffer.limit Total capacity of the buffers in this pool
Type: float
Unit: digital,B
process.runtime.jvm.buffer.usage Memory that the Java virtual machine is using for this buffer pool
Type: float
Unit: digital,B
process.runtime.jvm.classes.current_loaded Number of classes currently loaded
Type: float
Unit: count
process.runtime.jvm.classes.loaded Number of classes loaded since JVM start
Type: int
Unit: count
process.runtime.jvm.classes.unloaded Number of classes unloaded since JVM start
Type: float
Unit: count
process.runtime.jvm.cpu.utilization Recent cpu utilization for the process
Type: float
Unit: digital,B
process.runtime.jvm.gc.duration Duration of JVM garbage collection actions
Type: float
Unit: timeStamp,nsec
process.runtime.jvm.memory.committed Measure of memory committed
Type: float
Unit: digital,B
process.runtime.jvm.memory.init Measure of initial memory requested
Type: float
Unit: digital,B
process.runtime.jvm.memory.limit Measure of max obtainable memory
Type: float
Unit: digital,B
process.runtime.jvm.memory.usage Measure of memory used
Type: float
Unit: digital,B
process.runtime.jvm.memory.usage_after_last_gc Measure of memory used after the most recent garbage collection event on this pool
Type: float
Unit: digital,B
process.runtime.jvm.system.cpu.load_1m Average CPU load of the whole system for the last minute
Type: float
Unit: percent,percent
process.runtime.jvm.system.cpu.utilization Recent cpu utilization for the whole system
Type: float
Unit: percent,percent
process.runtime.jvm.threads.count Number of executing threads
Type: float
Unit: count
process.start.time Start time of the process since unix epoch
Type: float
Unit: digital,B
process.uptime The uptime of the Java virtual machine
Type: int
Unit: timeStamp,sec
processedSpans The number of spans processed by the BatchSpanProcessor
Type: int
Unit: count
queueSize The number of spans queued
Type: int
Unit: count
system.cpu.count The number of processors available to the Java virtual machine
Type: int
Unit: count
system.cpu.usage The "recent cpu usage" for the whole system
Type: float
Unit: percent,percent
system.load.average.1m The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time
Type: float
Unit: count

opentelemetry

This is the field description for the trace.

  • Tags
Tag Description
base_service Span Base service name
container_host Container hostname. Available in OpenTelemetry. Optional.
db_host DB host name: ip or domain name. Optional.
db_name Database name. Optional.
db_system Database system name:mysql,oracle... Optional.
dk_fingerprint DataKit fingerprint is DataKit hostname
endpoint Endpoint info. Available in SkyWalking, Zipkin. Optional.
env Application environment info. Available in Jaeger. Optional.
host Hostname.
http_method HTTP request method name. Available in DDTrace, OpenTelemetry. Optional.
http_route HTTP route. Optional.
http_status_code HTTP response code. Available in DDTrace, OpenTelemetry. Optional.
http_url HTTP URL. Optional.
operation Span name
out_host This is the database host, equivalent to db_host,only DDTrace-go. Optional.
project Project name. Available in Jaeger. Optional.
service Service name. Optional.
source_type Tracing source type
span_type Span type
status Span status
version Application version info. Available in Jaeger. Optional.
  • Metrics
Metric Description
duration Duration of span
Type: int
Unit: time,μs
message Origin content of span
Type: string
Unit: N/A
parent_id Parent span ID of current span
Type: string
Unit: N/A
resource Resource name produce current span
Type: string
Unit: N/A
span_id Span id
Type: string
Unit: N/A
start start time of span.
Type: int
Unit: timeStamp,usec
trace_id Trace id
Type: string
Unit: N/A

指标中删除的标签

OTEL 上报的指标中有很多无用的标签,这些都是 String 类型,由于太占用内存和带宽就做了删除,不会上传到观测云。

这些标签包括:

process.command_line
process.executable.path
process.runtime.description
process.runtime.name
process.runtime.version
telemetry.distro.name
telemetry.distro.version
telemetry.sdk.language
telemetry.sdk.name
telemetry.sdk.version

日志

Version-1.33.0

目前 JAVA Agent 支持采集 stdout 日志。并使用 Standard output 方式通过 otlp 协议发送到 DataKit 中。

OTEL Agent 默认情况下不开启 log 采集,必须需要通过显式命令: otel.logs.exporter 开启方式为:

# env
export OTEL_LOGS_EXPORTER=OTLP
export OTEL_EXPORTER_OTLP.ENDPOINT=http://<DataKit Addr>:4317
# other env
java -jar app.jar

# command
java -javaagent:/path/to/agnet.jar \
  -otel.logs.exporter=otlp \
  -Dotel.exporter.otlp.endpoint=http://<DataKit Addr>:4317 \
  -jar app.jar

默认情况下,日志内容的最大长度为 500KB ,超过的部分会分成多条日志。日志的标签最大长度为 32KB ,该字段不可配置,超过的部分会切割掉。

通过 OTEL 采集的日志的 source 为服务名,也可以通过添加标签的方式自定义:log.source ,比如:-Dotel.resource.attributes="log.source=source_name"

注意:如果 app 是运行在容器环境(比如 k8s),DataKit 本来就会自动采集日志(默认行为),如果再采集一次,会有重复采集的问题。建议在开启采集日志之前,手动关闭 DataKit 自主的日志采集行为

更多语言可以查看官方文档

示例

DataKit 目前提供了如下两种语言的最佳实践:

更多文档

文档评价

文档内容是否对您有帮助? ×