OpenTelemetry
OpenTelemetry (以下简称 OTEL)是 CNCF 的一个可观测性项目,旨在提供可观测性领域的标准化方案,解决观测数据的数据模型、采集、处理、导出等的标准化问题。
OTEL 是一组标准和工具的集合,旨在管理观测类数据,如 trace、metrics、logs 。本文档介绍如何在 DataKit 上配置并开启 OTEL 的数据接入,以及 Java、Go 的最佳实践。
配置¶
进入 DataKit 安装目录下的 conf.d/samples
目录,复制 opentelemetry.conf.sample
并命名为 opentelemetry.conf
。示例如下:
[[inputs.opentelemetry]]
## customer_tags will work as a whitelist to prevent tags send to data center.
## All . will replace to _ ,like this :
## "project.name" to send to center is "project_name"
# customer_tags = ["sink_project", "custom.otel.tag"]
## If set to true, all Attributes will be extracted and message.Attributes will be empty.
# customer_tags_all = false
## Keep rare tracing resources list switch.
## If some resources are rare enough(not presend in 1 hour), those resource will always send
## to data center and do not consider samplers and filters.
# keep_rare_resource = false
## By default every error presents in span will be send to data center and omit any filters or
## sampler. If you want to get rid of some error status, you can set the error status list here.
# omit_err_status = ["404"]
## compatible ddtrace: It is possible to compatible OTEL Trace with DDTrace trace
# compatible_ddtrace=false
## split service.name form xx.system.
## see: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/database/database-spans.md
split_service_name = true
## delete trace message
# del_message = true
## logging message data max length,default is 500kb
log_max = 500
## JSON marshaler: set JSON marshaler. available marshaler are:
## gojson/jsoniter/protojson
##
## For better performance, gojson and jsoniter is better than protojson,
## for compatible reason we still use protojson as default.
jmarshaler = "protojson"
## cleaned the top-level fields in message. Default true
clean_message = true
## tracing_metric_enable: trace_hits trace_hits_by_http_status trace_latency trace_errors trace_errors_by_http_status trace_apdex.
## Extract the above metrics from the collection traces.
# tracing_metric_enable = true
## Blacklist of metric tags: There are many labels in the metric: "tracing_metrics".
## If you want to remove certain tag, you can use the blacklist to remove them.
## By default, it includes: source,span_name,env,service,status,version,resource,http_status_code,http_status_class
## and "customer_tags", k8s related tags, and others service.
# tracing_metric_tag_blacklist = ["resource", "operation", "tag_a", "tag_b"]
## Ignore tracing resources map like service:[resources...].
## The service name is the full service name in current application.
## The resource list is regular expressions uses to block resource names.
## If you want to block some resources universally under all services, you can set the
## service name as "*". Note: double quotes "" cannot be omitted.
# [inputs.opentelemetry.close_resource]
# service1 = ["resource1", "resource2", ...]
# service2 = ["resource1", "resource2", ...]
# "*" = ["close_resource_under_all_services"]
# ...
## Sampler config uses to set global sampling strategy.
## sampling_rate used to set global sampling rate.
# [inputs.opentelemetry.sampler]
# sampling_rate = 1.0
# [inputs.opentelemetry.tags]
# key1 = "value1"
# key2 = "value2"
# ...
## Threads config controls how many goroutines an agent cloud start to handle HTTP request.
## buffer is the size of jobs' buffering of worker channel.
## threads is the total number fo goroutines at running time.
# [inputs.opentelemetry.threads]
# buffer = 100
# threads = 8
## Storage config a local storage space in hard dirver to cache trace data.
## path is the local file path used to cache data.
## capacity is total space size(MB) used to store data.
# [inputs.opentelemetry.storage]
# path = "./otel_storage"
# capacity = 5120
## OTEL agent HTTP config for trace and metrics
## If enable set to be true, trace and metrics will be received on path respectively, by default is:
## trace : /otel/v1/traces
## metric: /otel/v1/metrics
## and the client side should be configured properly with Datakit listening port(default: 9529)
## or custom HTTP request path.
## for example http://127.0.0.1:9529/otel/v1/traces
## The acceptable http_status_ok values will be 200 or 202.
[inputs.opentelemetry.http]
http_status_ok = 200
trace_api = "/otel/v1/traces"
metric_api = "/otel/v1/metrics"
logs_api = "/otel/v1/logs"
## OTEL agent GRPC config for trace and metrics.
## GRPC services for trace and metrics can be enabled respectively as setting either to be true.
## add is the listening on address for GRPC server.
[inputs.opentelemetry.grpc]
addr = "127.0.0.1:4317"
max_payload = 16777216 # default 16MiB
## If 'expected_headers' is well configed, then the obligation of sending certain wanted HTTP headers is on the client side,
## otherwise HTTP status code 400(bad request) will be provoked.
## Note: expected_headers will be effected on both trace and metrics if setted up.
# [inputs.opentelemetry.expected_headers]
# ex_version = "1.2.3"
# ex_name = "env_resource_name"
# ...
配置好后,重启 DataKit 即可。
可通过 ConfigMap 方式注入采集器配置 或 配置 ENV_DATAKIT_INPUTS 开启采集器。
也支持以环境变量的方式修改配置参数(需要在 ENV_DEFAULT_ENABLED_INPUTS 中加为默认采集器):
-
ENV_INPUT_OTEL_CUSTOMER_TAGS
标签白名单
字段类型: JSON
采集器配置字段:
customer_tags
示例:
'["project_id", "custom.tag"]'
-
ENV_INPUT_OTEL_CUSTOMER_TAGS_ALL
提取所有标签
字段类型: Boolean
采集器配置字段:
customer_tags_all
默认值: false
-
ENV_INPUT_OTEL_KEEP_RARE_RESOURCE
保持稀有跟踪资源列表
字段类型: Boolean
采集器配置字段:
keep_rare_resource
默认值: false
-
ENV_INPUT_OTEL_COMPATIBLE_DD_TRACE
将 trace_id 转成 10 进制,兼容 DDTrace
字段类型: Boolean
采集器配置字段:
compatible_dd_trace
默认值: false
-
ENV_INPUT_OTEL_SPLIT_SERVICE_NAME
从 span.Attributes 中获取 xx.system 去替换服务名
字段类型: Boolean
采集器配置字段:
split_service_name
默认值: false
-
ENV_INPUT_OTEL_TRACING_METRIC_ENABLE
开启请求计数,错误计数和延迟指标的采集
字段类型: Boolean
采集器配置字段:
tracing_metric_enable
默认值: false
-
ENV_INPUT_OTEL_TRACING_METRIC_TAG_BLACKLIST
指标集
tracing_metrics
中标签的黑名单字段类型: JSON
采集器配置字段:
tracing_metric_tag_blacklist
示例:
'["tag_a", "tag_b"]'
-
ENV_INPUT_OTEL_DEL_MESSAGE
删除 trace 消息
字段类型: Boolean
采集器配置字段:
del_message
默认值: false
-
ENV_INPUT_OTEL_OMIT_ERR_STATUS
错误状态白名单
字段类型: JSON
采集器配置字段:
omit_err_status
示例:
'["404", "403", "400"]'
-
ENV_INPUT_OTEL_CLOSE_RESOURCE
忽略指定服务器的 tracing(正则匹配)
字段类型: JSON
采集器配置字段:
close_resource
示例:
'{"service1":["resource1","other"],"service2":["resource2","other"]}'
-
ENV_INPUT_OTEL_SAMPLER
全局采样率
字段类型: Float
采集器配置字段:
sampler
示例: 0.3
-
ENV_INPUT_OTEL_THREADS
线程和缓存的数量
字段类型: JSON
采集器配置字段:
threads
示例:
'{"buffer":1000, "threads":100}'
-
ENV_INPUT_OTEL_STORAGE
本地缓存路径和大小(MB)
字段类型: JSON
采集器配置字段:
storage
示例:
'{"storage":"./otel_storage", "capacity": 5120}'
-
ENV_INPUT_OTEL_HTTP
代理 HTTP 配置
字段类型: JSON
采集器配置字段:
http
示例:
'{"enable":true, "http_status_ok": 200, "trace_api": "/otel/v1/traces", "metric_api": "/otel/v1/metrics"}'
-
ENV_INPUT_OTEL_GRPC
代理 GRPC 配置
字段类型: JSON
采集器配置字段:
grpc
示例:
'{"addr": "127.0.0.1:4317", "max_payload": 16777216 }'
-
ENV_INPUT_OTEL_EXPECTED_HEADERS
配置使用客户端的 HTTP 头
字段类型: JSON
采集器配置字段:
expected_headers
示例:
'{"ex_version": "1.2.3", "ex_name": "env_resource_name"}'
-
ENV_INPUT_OTEL_CLEAN_MESSAGE
精简
message
字段大小字段类型: Boolean
采集器配置字段:
clean_message
示例:
true/false
-
ENV_INPUT_OTEL_TAGS
自定义标签。如果配置文件有同名标签,将会覆盖它
字段类型: JSON
采集器配置字段:
tags
示例:
'{"k1":"v1", "k2":"v2", "k3":"v3"}'
注意事项¶
- 建议使用 gRPC 协议,gRPC 具有压缩率高、序列化快、效率更高等优点
- 自 DataKit 1.10.0 版本开始,http 协议的路由是可配置的,默认请求路径(Trace/Metric)分别为
/otel/v1/traces
/otel/v1/logs
以及/otel/v1/metrics
- 在涉及到
float/double
类型数据时,会最多保留两位小数 - HTTP 和 gRPC 都支持 gzip 压缩格式。在 exporter 中可配置环境变量来开启:
OTEL_EXPORTER_OTLP_COMPRESSION = gzip
, 默认是不会开启 gzip。 - HTTP 协议请求格式同时支持 JSON 和 Protobuf 两种序列化格式。但 gRPC 仅支持 Protobuf 一种。
Warning
- DDTrace 链路数据中的服务名是根据服务名或者引用的三方库命名的,而 OTEL 采集器的服务名是按照
otel.service.name
定义的 - 为了分开显示服务名,增加了一个字段配置:
spilt_service_name = true
- 服务名从链路数据的标签中取出,比如 DB 类型的标签
db.system=mysql
那么服务名就是 mysql。如果是消息队列类型,如messaging.system=kafka
,那么服务名就是kafka
- 默认从这三个标签中取出:
db.system/rpc.system/messaging.system
使用 OTEL HTTP exporter 时注意环境变量的配置,由于 DataKit 的默认配置是 /otel/v1/traces
/otel/v1/logs
和 /otel/v1/metrics
,所以想要使用 HTTP 协议的话,需要单独配置 trace
和 metric
,
Agent V2 版本¶
V2 版本默认使用 otlp exporter
将之前的 grpc
改为 http/protobuf
, 可以通过命令 -Dotel.exporter.otlp.protocol=grpc
设置,或者使用默认的 http/protobuf
使用 HTTP 的话,每个 exporter 路径需要显性配置 如:
java -javaagent:/usr/local/ddtrace/opentelemetry-javaagent-2.5.0.jar \
-Dotel.exporter=otlp \
-Dotel.exporter.otlp.protocol=http/protobuf \
-Dotel.exporter.otlp.logs.endpoint=http://localhost:9529/otel/v1/logs \
-Dotel.exporter.otlp.traces.endpoint=http://localhost:9529/otel/v1/traces \
-Dotel.exporter.otlp.metrics.endpoint=http://localhost:9529/otel/v1/metrics \
-Dotel.service.name=app \
-jar app.jar
使用 gRPC 协议的话,必须是显式配置,否则就是默认的 HTTP 协议:
java -javaagent:/usr/local/ddtrace/opentelemetry-javaagent-2.5.0.jar \
-Dotel.exporter=otlp \
-Dotel.exporter.otlp.protocol=grpc \
-Dotel.exporter.otlp.endpoint=http://localhost:4317 \
-Dotel.service.name=app \
-jar app.jar
默认日志是开启的,要关闭日志采集的话,exporter 配置为空即可:-Dotel.logs.exporter=none
更多关于 V2 版本的重大修改请查看官方文档或者 GitHub 版本说明: Github-v2.0.0
常规命令¶
启动应用时常用的有如下这些配置:
ENV(对应命令) | 说明 |
---|---|
OTEL_SDK_DISABLED(otel.sdk.disabled) |
关闭 SDK,默认 false。关闭后将不会产生任何链路指标信息 |
OTEL_RESOURCE_ATTRIBUTES(otel.resource.attributes) |
增加全局自定义 tag,每个 span 中都会带上这些自定义 tag。示例:service.name=App,project=app-a |
OTEL_SERVICE_NAME(otel.service.name) |
设置服务名,优先级高于自定义 tag |
OTEL_LOG_LEVEL(otel.log.level) |
日志级别,默认 info |
OTEL_PROPAGATORS(otel.propagators) |
设置透传协议,默认 tracecontext,baggage |
OTEL_TRACES_SAMPLER(otel.traces.sampler) |
设置采样率类型 |
OTEL_TRACES_SAMPLER_ARG(otel.traces.sampler.arg) |
配合上面采样参数,取值范围 0~1.0,默认 1.0 |
OTEL_EXPORTER_OTLP_PROTOCOL(otel.exporter.otlp.protocol) |
设置传输协议,默认 grpc ,可选 grpc,http/protobuf,http/json |
OTEL_EXPORTER_OTLP_ENDPOINT(otel.exporter.otlp.endpoint) |
设置 Trace 上传地址,此处需设置成 DataKit 地址 http://datakit-endpoint:9529/otel/v1/traces |
OTEL_TRACES_EXPORTER(otel.traces.exporter) |
链路导出器,默认 otlp |
OTEL_LOGS_EXPORTER(otel.logs.exporter) |
日志导出器,默认 otlp ,注意:OTEL V1 版本需要显式配置,否则默认不开启 |
我们可以将
otel.javaagent.debug=true
参数传递给 Agent 以查看调试日志。请注意,这些日志内容相当冗长,生产环境下谨慎使用。
链路采样¶
可以采用头部采样或者尾部采样,具体可以查看两篇最佳实践:
- 需要配合 collector 的尾部采样: OpenTelemetry 采样最佳实践
- Agent 端的头部采样: OpenTelemetry Java Agent 端采样策略
Tag 提取¶
从 DataKit 版本 1.22.0 开始,黑名单功能废弃。增加固定标签列表,只有在此列表中的才会提取到一级标签中,以下是固定列表:
Attributes | Tags | 说明 |
---|---|---|
http.url |
http_url |
HTTP 请求完整路径 |
http.hostname |
http_hostname |
hostname |
http.route |
http_route |
路由 |
http.status_code |
http_status_code |
状态码 |
http.request.method |
http_request_method |
请求方法 |
http.method |
http_method |
同上 |
http.client_ip |
http_client_ip |
客户端 IP |
http.scheme |
http_scheme |
请求协议 |
url.full |
url_full |
请求全路径 |
url.scheme |
url_scheme |
请求协议 |
url.path |
url_path |
请求路径 |
url.query |
url_query |
请求参数 |
span_kind |
span_kind |
span 类型 |
db.system |
db_system |
span 类型 |
db.operation |
db_operation |
DB 动作 |
db.name |
db_name |
数据库名称 |
db.statement |
db_statement |
详细信息 |
server.address |
server_address |
服务地址 |
net.host.name |
net_host_name |
请求的 host |
server.port |
server_port |
服务端口号 |
net.host.port |
net_host_port |
同上 |
network.peer.address |
network_peer_address |
网络地址 |
network.peer.port |
network_peer_port |
网络端口 |
network.transport |
network_transport |
协议 |
messaging.system |
messaging_system |
消息队列名称 |
messaging.operation |
messaging_operation |
消息动作 |
messaging.message |
messaging_message |
消息 |
messaging.destination |
messaging_destination |
消息详情 |
rpc.service |
rpc_service |
RPC 服务地址 |
rpc.system |
rpc_system |
RPC 服务名称 |
error |
error |
是否错误 |
error.message |
error_message |
错误信息 |
error.stack |
error_stack |
堆栈信息 |
error.type |
error_type |
错误类型 |
error.msg |
error_message |
错误信息 |
project |
project |
project |
version |
version |
版本 |
env |
env |
环境 |
host |
host |
Attributes 中的 host 标签 |
pod_name |
pod_name |
Attributes 中的 pod_name 标签 |
pod_namespace |
pod_namespace |
Attributes 中的 pod_namespace 标签 |
如果想要增加自定义标签,可使用环境变量:
Span Kind¶
所有的 span 都有 span_kind
标签,共有 6 中属性:
unspecified
: 未设置。internal
: 内部 span 或子 span 类型。server
: WEB 服务、RPC 服务 等等。client
: 客户端类型。producer
: 消息的生产者。consumer
: 消息的消费者。
指标采集¶
OpenTelemetry Java Agent 从应用程序中通过 JMX 协议获取 MBean 的指标信息,Java Agent 通过内部 SDK 报告选定的 JMX 指标,这意味着所有的指标都是可以配置的。
可以通过命令 otel.jmx.enabled=true/false
开启和关闭 JMX 指标上报(默认是开启的)。为了控制 MBean 检测尝试之间的时间间隔,可以使用 otel.jmx.discovery.delay
命令,该属性定义了在第一个和下一个检测周期之间通过的毫秒间隔。
另外 Agent 内置的一些三方软件的采集配置。具体可以参考: GitHub OTEL JMX Metric
针对 Histogram 指标我们做了特殊处理:
-
OpenTelemetry 的直方图桶会被直接映射到 Prometheus 的直方图桶
-
每个桶的计数会被转换为 Prometheus 的累积计数格式,例如,OpenTelemetry 的桶
[0, 10)
、[10, 50)
、[50, 100)
会被转换为 Prometheus 的_bucket
指标,并附带le
标签:
-
OpenTelemetry 直方图的总观测值数量会被转换为 Prometheus 的
_count
指标。 -
OpenTelemetry 直方图的总和会被转换为 Prometheus 的
_sum
指标,还会添加_max
_min
。
凡是以 _bucket
结尾的指标都是直方图数据,并且一定有 _max
_min
_count
sum
结尾的指标。
在直方图数据中可以使用 le(less or equal)
标签进行分类,并且可以根据标签进行筛选,可以查看 OpenTelemetry Metrics 所有的指标和标签。
这种转换使得 OpenTelemetry 收集的直方图数据能够无缝集成到 Prometheus 中,并利用 Prometheus 的强大查询和可视化功能进行分析。
日志采集¶
目前 JAVA Agent 支持采集 stdout
日志。并使用 Standard output 方式通过 otlp
协议发送到 DataKit 中。
OTEL Agent V1 默认情况下不开启 log 采集,必须需要通过显式命令,开启方式为:
# env
export OTEL_LOGS_EXPORTER=OTLP
export OTEL_EXPORTER_OTLP.ENDPOINT=http://<DataKit Addr>:4317
java -jar app.jar
# command
java -javaagent:/path/to/agnet.jar \
-otel.logs.exporter=otlp \
-Dotel.exporter.otlp.endpoint=http://<DataKit Addr>:4317 \
-jar app.jar
默认情况下,日志内容的最大长度为 500KB ,超过的部分会分成多条日志。日志的标签最大长度为 32KB ,该字段不可配置,超过的部分会切割掉。
通过 OTEL 采集的日志的 source
为服务名,也可以通过添加标签的方式自定义:log.source
,比如:-Dotel.resource.attributes="log.source=source_name"
。
注意:如果 app 是运行在容器环境(比如 k8s),DataKit 本来就会自动采集日志(默认行为),如果再采集一次,会有重复采集的问题。建议在开启采集日志之前,手动关闭 DataKit 自主的日志采集行为
更多语言可以查看官方文档
采集字段说明¶
Tracing¶
opentelemetry
¶
以下是采集上来的 tracing 字段说明
Tags & Fields | Description |
---|---|
base_service ( tag ) |
Span base service name |
container_host ( tag ) |
Container hostname. Available in OpenTelemetry. Optional. |
db_host ( tag ) |
DB host name: ip or domain name. Optional. |
db_name ( tag ) |
Database name. Optional. |
db_system ( tag ) |
Database system name:mysql,oracle... Optional. |
dk_fingerprint ( tag ) |
DataKit fingerprint(always DataKit's hostname) |
endpoint ( tag ) |
Endpoint info. Available in SkyWalking, Zipkin. Optional. |
env ( tag ) |
Application environment info. Available in Jaeger. Optional. |
host ( tag ) |
Hostname. |
http_method ( tag ) |
HTTP request method name. Available in DDTrace, OpenTelemetry. Optional. |
http_route ( tag ) |
HTTP route. Optional. |
http_status_code ( tag ) |
HTTP response code. Available in DDTrace, OpenTelemetry. Optional. |
http_url ( tag ) |
HTTP URL. Optional. |
operation ( tag ) |
Span name |
out_host ( tag ) |
This is the database host, equivalent to db_host,only DDTrace-go. Optional. |
project ( tag ) |
Project name. Available in Jaeger. Optional. |
service ( tag ) |
Service name. Optional. |
source_type ( tag ) |
Tracing source type |
span_type ( tag ) |
Span type |
status ( tag ) |
Span status |
version ( tag ) |
Application version info. Available in Jaeger. Optional. |
duration | Duration of span Type: int Unit: time,μs |
message | Origin content of span Type: string Unit: N/A |
parent_id | Parent span ID of current span Type: string Unit: N/A |
resource | Resource name produce current span Type: string Unit: N/A |
span_id | Span id Type: string Unit: N/A |
start | start time of span. Type: int Unit: timeStamp,usec |
trace_id | Trace id Type: string Unit: N/A |
指标¶
otel_service
¶
Tags & Fields | Description |
---|---|
action ( tag ) |
GC Action |
area ( tag ) |
Heap or not |
cause ( tag ) |
GC Cause |
container_id ( tag ) |
Container ID |
db_host ( tag ) |
DB host name: ip or domain name |
db_name ( tag ) |
Database name |
db_system ( tag ) |
Database system name:mysql,oracle... |
direction ( tag ) |
received or sent |
exception ( tag ) |
Exception Information |
gc ( tag ) |
GC Type |
host ( tag ) |
Host Name |
host_arch ( tag ) |
Host arch |
host_name ( tag ) |
Host Name |
http.scheme ( tag ) |
HTTP/HTTPS |
http_method ( tag ) |
HTTP Method |
http_request_method ( tag ) |
HTTP Method |
http_response_status_code ( tag ) |
HTTP status code |
http_route ( tag ) |
HTTP Route |
id ( tag ) |
JVM Type |
instrumentation_name ( tag ) |
Metric Name |
jvm_gc_action ( tag ) |
action:end of major,end of minor GC |
jvm_gc_name ( tag ) |
name:PS MarkSweep,PS Scavenge |
jvm_memory_pool_name ( tag ) |
pool_name:code cache,PS Eden Space,PS Old Gen,MetaSpace... |
jvm_memory_type ( tag ) |
memory type:heap,non_heap |
jvm_thread_state ( tag ) |
Thread state:runnable,timed_waiting,waiting |
le ( tag ) |
*_bucket: histogram metric explicit bounds |
level ( tag ) |
Log Level |
main-application-class ( tag ) |
Main Entry Point |
method ( tag ) |
HTTP Type |
name ( tag ) |
Thread Pool Name |
net_protocol_name ( tag ) |
Net Protocol Name |
net_protocol_version ( tag ) |
Net Protocol Version |
os_type ( tag ) |
OS Type |
outcome ( tag ) |
HTTP Outcome |
path ( tag ) |
Disk Path |
pool ( tag ) |
JVM Pool Type |
scope_name ( tag ) |
Scope name |
service_name ( tag ) |
Service Name |
spanProcessorType ( tag ) |
Span Processor Type |
state ( tag ) |
Thread State:idle,used |
status ( tag ) |
HTTP Status Code |
type ( tag ) |
Kafka broker type |
unit ( tag ) |
metrics unit |
uri ( tag ) |
HTTP Request URI |
application.ready.time | Time taken (ms) for the application to be ready to service requests Type: float Unit: timeStamp,msec |
application.started.time | Time taken (ms) to start the application Type: float Unit: timeStamp,msec |
disk.free | Usable space for path Type: float Unit: digital,B |
disk.total | Total space for path Type: float Unit: digital,B |
executor.active | The approximate number of threads that are actively executing tasks Type: float Unit: count |
executor.completed | The approximate total number of tasks that have completed execution Type: float Unit: count |
executor.pool.core | The core number of threads for the pool Type: float Unit: digital,B |
executor.pool.max | The maximum allowed number of threads in the pool Type: float Unit: count |
executor.pool.size | The current number of threads in the pool Type: float Unit: digital,B |
executor.queue.remaining | The number of additional elements that this queue can ideally accept without blocking Type: float Unit: count |
executor.queued | The approximate number of tasks that are queued for execution Type: float Unit: count |
http.server.active_requests | The number of concurrent HTTP requests that are currently in-flight Type: float Unit: count |
http.server.duration | The duration of the inbound HTTP request Type: float Unit: time,ns |
http.server.request.duration | The count of HTTP request duration time in each bucket Type: float Unit: count |
http.server.requests | The http request count Type: float Unit: count |
http.server.requests.max | None Type: float Unit: digital,B |
http.server.response.size | The size of HTTP response messages Type: float Unit: digital,B |
http.server.tomcat.errorCount | The number of errors per second on all request processors Type: float Unit: count |
http.server.tomcat.maxTime | The longest request processing time Type: float Unit: timeStamp,msec |
http.server.tomcat.processingTime | Represents the total time for processing all requests Type: float Unit: timeStamp,msec |
http.server.tomcat.requestCount | The number of requests per second across all request processors Type: float Unit: count |
http.server.tomcat.sessions.activeSessions | The number of active sessions Type: float Unit: count |
http.server.tomcat.threads | Thread Count of the Thread Pool Type: float Unit: count |
http.server.tomcat.traffic | The number of bytes transmitted Type: float Unit: traffic,B/S |
jvm.buffer.count | An estimate of the number of buffers in the pool Type: float Unit: count |
jvm.buffer.memory.used | An estimate of the memory that the Java virtual machine is using for this buffer pool Type: float Unit: digital,B |
jvm.buffer.total.capacity | An estimate of the total capacity of the buffers in this pool Type: float Unit: digital,B |
jvm.classes.loaded | The number of classes that are currently loaded in the Java virtual machine Type: float Unit: count |
jvm.classes.unloaded | The total number of classes unloaded since the Java virtual machine has started execution Type: float Unit: count |
jvm.gc.live.data.size | Size of long-lived heap memory pool after reclamation Type: float Unit: digital,B |
jvm.gc.max.data.size | Max size of long-lived heap memory pool Type: float Unit: digital,B |
jvm.gc.memory.allocated | Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next Type: float Unit: digital,B |
jvm.gc.memory.promoted | Count of positive increases in the size of the old generation memory pool before GC to after GC Type: float Unit: digital,B |
jvm.gc.overhead | An approximation of the percent of CPU time used by GC activities over the last look back period or since monitoring began, whichever is shorter, in the range [0..1] Type: int Unit: count |
jvm.gc.pause | Time spent in GC pause Type: float Unit: timeStamp,nsec |
jvm.gc.pause.max | Time spent in GC pause Type: float Unit: timeStamp,msec |
jvm.memory.committed | The amount of memory in bytes that is committed for the Java virtual machine to use Type: float Unit: digital,B |
jvm.memory.max | The maximum amount of memory in bytes that can be used for memory management Type: float Unit: digital,B |
jvm.memory.usage.after.gc | The percentage of long-lived heap pool used after the last GC event, in the range [0..1] Type: float Unit: percent,percent |
jvm.memory.used | The amount of used memory Type: float Unit: digital,B |
jvm.threads.daemon | The current number of live daemon threads Type: float Unit: count |
jvm.threads.live | The current number of live threads including both daemon and non-daemon threads Type: float Unit: digital,B |
jvm.threads.peak | The peak live thread count since the Java virtual machine started or peak was reset Type: float Unit: digital,B |
jvm.threads.states | The current number of threads having NEW state Type: float Unit: digital,B |
kafka.controller.active.count | The number of controllers active on the broker Type: float Unit: count |
kafka.isr.operation.count | The number of in-sync replica shrink and expand operations Type: float Unit: count |
kafka.lag.max | The max lag in messages between follower and leader replicas Type: float Unit: timeStamp,msec |
kafka.leaderElection.count | The leader election count Type: float Unit: count |
kafka.leaderElection.unclean.count | Unclean leader election count - increasing indicates broker failures Type: float Unit: count |
kafka.message.count | The number of messages received by the broker Type: float Unit: count |
kafka.network.io | The bytes received or sent by the broker Type: float Unit: digital,B |
kafka.partition.count | The number of partitions on the broker Type: float Unit: count |
kafka.partition.offline | The number of partitions offline Type: float Unit: count |
kafka.partition.underReplicated | The number of under replicated partitions Type: float Unit: count |
kafka.purgatory.size | The number of requests waiting in purgatory Type: float Unit: count |
kafka.request.count | The number of requests received by the broker Type: float Unit: count |
kafka.request.failed | The number of requests to the broker resulting in a failure Type: float Unit: count |
kafka.request.queue | Size of the request queue Type: float Unit: count |
kafka.request.time.50p | The 50th percentile time the broker has taken to service requests Type: float Unit: timeStamp,msec |
kafka.request.time.99p | The 99th percentile time the broker has taken to service requests Type: float Unit: timeStamp,msec |
kafka.request.time.total | The total time the broker has taken to service requests Type: float Unit: timeStamp,msec |
log4j2.events | Number of fatal level log events Type: float Unit: count |
otlp.exporter.exported | OTLP exporter to remote Type: int Unit: count |
otlp.exporter.seen | OTLP exporter Type: int Unit: count |
process.cpu.usage | The "recent cpu usage" for the Java Virtual Machine process Type: float Unit: percent,percent |
process.files.max | The maximum file descriptor count Type: float Unit: count |
process.files.open | The open file descriptor count Type: float Unit: digital,B |
process.runtime.jvm.buffer.count | The number of buffers in the pool Type: float Unit: count |
process.runtime.jvm.buffer.limit | Total capacity of the buffers in this pool Type: float Unit: digital,B |
process.runtime.jvm.buffer.usage | Memory that the Java virtual machine is using for this buffer pool Type: float Unit: digital,B |
process.runtime.jvm.classes.current_loaded | Number of classes currently loaded Type: float Unit: count |
process.runtime.jvm.classes.loaded | Number of classes loaded since JVM start Type: int Unit: count |
process.runtime.jvm.classes.unloaded | Number of classes unloaded since JVM start Type: float Unit: count |
process.runtime.jvm.cpu.utilization | Recent cpu utilization for the process Type: float Unit: digital,B |
process.runtime.jvm.gc.duration | Duration of JVM garbage collection actions Type: float Unit: timeStamp,nsec |
process.runtime.jvm.memory.committed | Measure of memory committed Type: float Unit: digital,B |
process.runtime.jvm.memory.init | Measure of initial memory requested Type: float Unit: digital,B |
process.runtime.jvm.memory.limit | Measure of max obtainable memory Type: float Unit: digital,B |
process.runtime.jvm.memory.usage | Measure of memory used Type: float Unit: digital,B |
process.runtime.jvm.memory.usage_after_last_gc | Measure of memory used after the most recent garbage collection event on this pool Type: float Unit: digital,B |
process.runtime.jvm.system.cpu.load_1m | Average CPU load of the whole system for the last minute Type: float Unit: percent,percent |
process.runtime.jvm.system.cpu.utilization | Recent cpu utilization for the whole system Type: float Unit: percent,percent |
process.runtime.jvm.threads.count | Number of executing threads Type: float Unit: count |
process.start.time | Start time of the process since unix epoch Type: float Unit: digital,B |
process.uptime | The uptime of the Java virtual machine Type: int Unit: timeStamp,sec |
processedSpans | The number of spans processed by the BatchSpanProcessor Type: int Unit: count |
queueSize | The number of spans queued Type: int Unit: count |
system.cpu.count | The number of processors available to the Java virtual machine Type: int Unit: count |
system.cpu.usage | The "recent cpu usage" for the whole system Type: float Unit: percent,percent |
system.load.average.1m | The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time Type: float Unit: count |
tracing_metrics
¶
基于 OpenTelemetry 统计得到的指标数据,它记录了所产生的 span 计数、span 耗时等指标
Tags & Fields | Description |
---|---|
env ( tag ) |
Application environment info(if set in span). |
host ( tag ) |
Hostname. |
http_status_class ( tag ) |
HTTP response code class, such as 2xx/3xx/4xx/5xx |
http_status_code ( tag ) |
HTTP response code |
operation ( tag ) |
Span name |
pod_name ( tag ) |
Pod name(if set in span). |
pod_namespace ( tag ) |
Pod namespace(if set in span). |
project ( tag ) |
Project name(if set in span). |
remote_ip ( tag ) |
Remote IP. |
resource ( tag ) |
Application resource name. |
service ( tag ) |
Service name. |
source ( tag ) |
Source, always opentelemetry |
status ( tag ) |
Span status(ok/error ) |
version ( tag ) |
Application version info. |
apdex | Measures the Apdex score for each web service. The currently set satisfaction threshold is 2 seconds.The tags for this metric are fixed: service/env/version/resource/source . The value range is 0~1.Type: float Unit: N/A |
errors | Represent the count of errors for spans. Type: int Unit: count |
errors_by_http_status | Represent the count of errors for a given span group by HTTP status code. Type: int Unit: count |
hits | Count of spans. Type: int Unit: count |
hits_by_http_status | Represent the count of hits for a given span group by HTTP status code. Type: int Unit: count |
latency_bucket | Represent the latency distribution for all services, resources, and versions across different environments and additional primary tags. Recommended for all latency measurement use cases. Use the 'le' tag for filtering Type: int Unit: count |
latency_count | The number of spans is equal to the number of web type spans. Type: int Unit: count |
latency_sum | The total latency of all web spans, corresponding to the 'latency_count' Type: int Unit: time,μs |
指标中删除的标签¶
otel_service
指标集中,原始上报的指标中有很多无用的标签,这些都是 String 类型,由于太占用内存和带宽予以丢弃,这些标签如下:
process.command_line
process.executable.path
process.runtime.description
process.runtime.name
process.runtime.version
telemetry.distro.name
telemetry.distro.version
telemetry.sdk.language
telemetry.sdk.name
telemetry.sdk.version
示例¶
DataKit 目前提供了如下两种语言的最佳实践: