Dataway 尾采样¶

功能介绍¶

Dataway 提供尾采样能力，对外接口包括：

/v1/tail_sampling
/v1/tail_sampling_v2
/v1/tail_sampling_config

尾采样用于先在 Dataway 侧接收按分组打包后的数据，再根据采样规则决定保留还是丢弃，最终把保留的数据写入中心。

当前支持三类数据：

tracing
logging
rum

基本处理流程如下：

sequenceDiagram
autonumber

participant dk as Datakit/Client
participant dw as Dataway
participant ts as TailSamplingProcessor
participant kodo as Kodo

dk ->> dw: POST /v1/tail_sampling
alt config ready
    dw ->> ts: ingest packet
    ts ->> dw: kept packets
    dw ->> kodo: write tracing/logging/rum
else config not ready
    dw ->> dw: pending cache
    dw -->> dk: 412 Precondition Failed
    dk ->> dw: POST /v1/tail_sampling_config
    dw ->> ts: update config and drain pending
end

工作模式¶

尾采样和聚合共用同一组模式配置：

standalone
proxy

standalone¶

standalone 模式下，当前 Dataway 直接处理尾采样数据：

接收 protobuf 编码的 aggregate.DataPacket
根据 token + data_type 查找尾采样配置
配置已就绪时，直接写入 TailSamplingProcessor
周期性取出到期分组并发送到对应数据类型写入接口

当前实现中：

采样窗口推进周期为 1 秒
派生指标刷新周期为 1 分钟
发送阶段使用 worker pool 异步写出

proxy¶

proxy 模式下，当前 Dataway 不保留本地尾采样状态：

/v1/tail_sampling 和 /v1/tail_sampling_v2 会转发到后端节点
/v1/tail_sampling_config 会广播到所有后端节点

因此在 proxy 模式下：

必须配置 aggregator_endpoint
客户端需要携带合法的 Guance-Pick-Key
后端节点负责真正的采样和状态维护

Warning

在 Kubernetes 部署中，如果前置 Dataway 需要把尾采样请求稳定转发到固定后置节点，那么 aggregator_endpoint 必须填写稳定不变的后端地址。这里建议后置 Dataway 使用 StatefulSet 部署，以保证 Pod 地址和 DNS 名称稳定，便于前置 Dataway 固定转发。

本地配置¶

Dataway 本地没有单独的尾采样 YAML 配置项，尾采样使用和聚合相同的模式配置：

aggregator_mode: standalone
aggregator_endpoint:
  - http://dataway-0:9528
  - http://dataway-1:9528

环境变量：

DW_AGGREGATOR_MODE=standalone
DW_AGGREGATOR_ENDPOINTS=http://dataway-0:9528,http://dataway-1:9528

说明：

standalone：当前节点自己持有尾采样状态
proxy：当前节点只做转发或广播

在 Kubernetes 中，如果前置 Dataway 作为入口层、后置 Dataway 负责实际尾采样，后置节点更适合使用 StatefulSet 部署，并将 StatefulSet Pod 的稳定地址写入 aggregator_endpoint。

采样配置下发¶

尾采样规则通过接口下发，而不是写在 dataway.yaml 中：

POST /v1/tail_sampling_config

请求体为 JSON，顶层结构如下：

{
  "version": 1,
  "trace": {},
  "logging": {},
  "rum": {}
}

其中：

trace 对应 tracing 尾采样配置
logging 对应 logging 尾采样配置
rum 对应 rum 尾采样配置

tracing 配置示例¶

{
  "version": 1,
  "trace": {
    "version": 1,
    "data_ttl": "5m",
    "group_key": "trace_id",
    "pipelines": [
      {
        "name": "keep-all",
        "type": "probabilistic",
        "rate": 1
      }
    ],
    "builtin_metrics": [
      {
        "name": "trace_total_count",
        "enabled": true
      }
    ]
  }
}

说明：

trace.group_key 当前只能是 trace_id
trace.data_ttl 为空时默认 5m
pipelines 支持 condition 和 probabilistic
condition 使用 action=keep/drop
probabilistic 使用 rate=0~1

logging 配置示例¶

{
  "version": 1,
  "logging": {
    "version": 1,
    "data_ttl": "1m",
    "group_dimensions": [
      {
        "group_key": "service",
        "pipelines": [
          {
            "name": "keep-all",
            "type": "probabilistic",
            "rate": 1
          }
        ]
      }
    ]
  }
}

rum 配置示例¶

{
  "version": 1,
  "rum": {
    "version": 1,
    "data_ttl": "1m",
    "group_dimensions": [
      {
        "group_key": "session_id",
        "pipelines": [
          {
            "name": "keep-all",
            "type": "probabilistic",
            "rate": 1
          }
        ]
      }
    ]
  }
}

Info

logging 和 rum 使用 group_dimensions 配置分组维度；data_ttl 为空时默认都是 1m。

Warning

当前实现会校验配置内容。trace 只允许 group_key=trace_id；derived_metrics 目前还不支持，配置后会返回错误。

数据上报接口¶

尾采样数据接口：

POST /v1/tail_sampling
POST /v1/tail_sampling_v2

说明：

两个接口当前走同一套处理逻辑
standalone 模式下，请求体需要是 protobuf 编码的 aggregate.DataPacket
proxy 模式下，请求会被转发到后端节点

412 与 pending cache¶

在 standalone 模式下，如果 Dataway 刚启动、对应 token + data_type 的采样配置还没下发：

Dataway 会先把这批数据放入本地 pending cache
然后返回 412 Precondition Failed

当前行为：

pending cache 是内存缓存
按 token + data_type 暂存
配置下发成功后，会自动把可用数据 drain 到 TailSamplingProcessor
当前默认最多缓存 100000 个 packet

约定行为：

客户端收到 412 后，视为这批数据已经被 Dataway 接住
客户端只需要继续发送 /v1/tail_sampling_config
客户端不需要重复发送这批数据

异常情况：

如果 pending cache 已满，Dataway 会返回 503
这种情况下不能再把请求视为已接收

尾采样内置指标¶

尾采样配置支持 builtin_metrics。这些指标由尾采样处理器在采样过程中生成，并在周期性刷新时写入中心。

当前内置指标如下。

tracing¶

trace_total_count
trace_kept_count
trace_dropped_count
trace_error_count
span_total_count
trace_duration

其中：

trace_duration 为时长分布指标
其它为计数指标

logging¶

logging_total_count
logging_error_count
logging_kept_count
logging_dropped_count

rum¶

rum_total_count
rum_kept_count
rum_dropped_count

说明：

builtin_metrics 为空时，当前默认会把该数据类型支持的内置指标全部开启
这些指标来自尾采样处理过程本身，不是 Dataway 自身运行指标

Dataway 自动上报指标¶

除采样器自己的 builtin_metrics 外，apis/metrics_special.go 还会自动维护一组 Dataway 自观测指标，用来描述尾采样 API 的处理情况。

当前与尾采样相关的指标包括：

指标名	类型	标签	说明
`dataway_http_api_body_size_bytes_total`	Counter	`api`, `token`	尾采样接口请求体累计字节数
`dataway_http_tail_sampling_trace_total`	Counter	`token`	接收到的 tracing 分组数
`dataway_http_tail_sampling_span_total`	Counter	`token`	接收到的 tracing span 总数
`dataway_http_tail_sampling_packet_send_total`	Counter	`token`, `data_type`, `result`	发送结果统计，`result` 包括 `success`、`failure`、`drop`

这些指标会：

每 1 分钟采集一次
被转换为 dataway_aggregate 指标点
使用 Dataway 默认 token 上报到 /v1/write/metric
上报后重置当前累积值

这组指标反映的是 Dataway 自身处理尾采样流量的运行状态，而不是采样规则本身的业务统计。