Skip to content

Dataway Tail Sampling


Introduction

Dataway provides tail-sampling APIs:

  • /v1/tail_sampling
  • /v1/tail_sampling_v2
  • /v1/tail_sampling_config

Tail sampling receives grouped data on Dataway first, applies sampling rules, and sends the kept data upstream.

Three data types are currently supported:

  • tracing
  • logging
  • rum

The basic flow is:

sequenceDiagram
autonumber

participant dk as Datakit/Client
participant dw as Dataway
participant ts as TailSamplingProcessor
participant kodo as Kodo

dk ->> dw: POST /v1/tail_sampling
alt config ready
    dw ->> ts: ingest packet
    ts ->> dw: kept packets
    dw ->> kodo: write tracing/logging/rum
else config not ready
    dw ->> dw: pending cache
    dw -->> dk: 412 Precondition Failed
    dk ->> dw: POST /v1/tail_sampling_config
    dw ->> ts: update config and drain pending
end

Working Modes

Tail sampling uses the same mode settings as aggregate:

  • standalone
  • proxy

standalone

In standalone mode, the current Dataway handles tail-sampling data directly:

  • receive protobuf-encoded aggregate.DataPacket
  • look up sampling config by token + data_type
  • ingest the packet into TailSamplingProcessor when config is ready
  • periodically flush expired groups to the target write API

In the current implementation:

  • the sampling loop advances every 1 second
  • derived metrics are flushed every 1 minute
  • a worker pool is used for async delivery

proxy

In proxy mode, the current Dataway does not keep local tail-sampling state:

  • /v1/tail_sampling and /v1/tail_sampling_v2 are forwarded to backend nodes
  • /v1/tail_sampling_config is broadcast to all backend nodes

That means in proxy mode:

  • aggregator_endpoint is required
  • the client must send a valid Guance-Pick-Key
  • backend nodes keep the actual sampling state
Warning

In Kubernetes, if the front Dataway needs to forward tail-sampling requests to fixed backend nodes, aggregator_endpoint must contain stable backend addresses. Backend Dataway nodes should be deployed with StatefulSet so each Pod keeps a stable address and DNS name for deterministic forwarding.

Local Configuration

There is no separate local YAML section for tail sampling. Dataway uses the same mode settings as aggregate:

aggregator_mode: standalone
aggregator_endpoint:
  - http://dataway-0:9528
  - http://dataway-1:9528

Environment variables:

DW_AGGREGATOR_MODE=standalone
DW_AGGREGATOR_ENDPOINTS=http://dataway-0:9528,http://dataway-1:9528

Notes:

  • standalone: the current node keeps tail-sampling state locally
  • proxy: the current node only forwards or broadcasts

In Kubernetes, when a front Dataway acts as the ingress layer and backend Dataway nodes perform the actual tail sampling, StatefulSet is the better backend deployment model and its stable Pod addresses should be used in aggregator_endpoint.

Sampling Config Delivery

Sampling rules are delivered by API instead of being written in dataway.yaml:

POST /v1/tail_sampling_config

The request body is JSON with this top-level structure:

{
  "version": 1,
  "trace": {},
  "logging": {},
  "rum": {}
}

Where:

  • trace is the tracing tail-sampling config
  • logging is the logging tail-sampling config
  • rum is the RUM tail-sampling config

Tracing Example

{
  "version": 1,
  "trace": {
    "version": 1,
    "data_ttl": "5m",
    "group_key": "trace_id",
    "pipelines": [
      {
        "name": "keep-all",
        "type": "probabilistic",
        "rate": 1
      }
    ],
    "builtin_metrics": [
      {
        "name": "trace_total_count",
        "enabled": true
      }
    ]
  }
}

Notes:

  • trace.group_key currently only supports trace_id
  • trace.data_ttl defaults to 5m when empty
  • pipelines support condition and probabilistic
  • condition uses action=keep/drop
  • probabilistic uses rate=0~1

Logging Example

{
  "version": 1,
  "logging": {
    "version": 1,
    "data_ttl": "1m",
    "group_dimensions": [
      {
        "group_key": "service",
        "pipelines": [
          {
            "name": "keep-all",
            "type": "probabilistic",
            "rate": 1
          }
        ]
      }
    ]
  }
}

RUM Example

{
  "version": 1,
  "rum": {
    "version": 1,
    "data_ttl": "1m",
    "group_dimensions": [
      {
        "group_key": "session_id",
        "pipelines": [
          {
            "name": "keep-all",
            "type": "probabilistic",
            "rate": 1
          }
        ]
      }
    ]
  }
}
Info

logging and rum use group_dimensions for grouping. When data_ttl is empty, both default to 1m.

Warning

The current implementation validates the config. trace only allows group_key=trace_id, and derived_metrics is not supported yet.

Data APIs

Tail-sampling data APIs:

POST /v1/tail_sampling
POST /v1/tail_sampling_v2

Notes:

  • both APIs currently use the same handler
  • in standalone mode, the request body must be protobuf-encoded aggregate.DataPacket
  • in proxy mode, the request is forwarded to a backend node

412 And Pending Cache

In standalone mode, if Dataway has just started and the sampling config for the current token + data_type has not arrived yet:

  • Dataway stores the packet in the local pending cache first
  • then returns 412 Precondition Failed

Current behavior:

  • the pending cache is in memory
  • packets are grouped by token + data_type
  • after config delivery succeeds, matching packets are drained into TailSamplingProcessor
  • the current default limit is 100000 packets

Expected client behavior:

  • when the client receives 412, it should treat the packet as already accepted by Dataway
  • the client should continue sending /v1/tail_sampling_config
  • the client should not resend the same data packet

Failure case:

  • if the pending cache is full, Dataway returns 503
  • in that case the request must not be treated as accepted

Built-in Sampling Metrics

Tail-sampling config supports builtin_metrics. These metrics are generated by the tail-sampling processor itself and flushed upstream periodically.

The current built-in metrics are:

tracing

  • trace_total_count
  • trace_kept_count
  • trace_dropped_count
  • trace_error_count
  • span_total_count
  • trace_duration

Where:

  • trace_duration is a duration-distribution metric
  • the others are count metrics

logging

  • logging_total_count
  • logging_error_count
  • logging_kept_count
  • logging_dropped_count

rum

  • rum_total_count
  • rum_kept_count
  • rum_dropped_count

Notes:

  • when builtin_metrics is empty, all supported built-in metrics for that data type are enabled by default
  • these metrics come from the sampling process itself, not from Dataway runtime observation

Automatic Dataway Metrics

Besides sampling builtin_metrics, apis/metrics_special.go also tracks Dataway runtime metrics for the tail-sampling APIs.

The current tail-sampling-related metrics are:

Metric Type Labels Description
dataway_http_api_body_size_bytes_total Counter api, token accumulated request body bytes for tail-sampling APIs
dataway_http_tail_sampling_trace_total Counter token received tracing group count
dataway_http_tail_sampling_span_total Counter token received tracing span count
dataway_http_tail_sampling_packet_send_total Counter token, data_type, result send result count, where result includes success, failure, and drop

These metrics are:

  • gathered every 1 minute
  • converted into dataway_aggregate points
  • sent to /v1/write/metric with the default Dataway token
  • reset after a reporting round

These metrics describe Dataway's own tail-sampling runtime behavior rather than the business-level sampling outcome.

Feedback

Is this page helpful? ×