Dataway Tail Sampling¶
Introduction¶
Dataway provides tail-sampling APIs:
/v1/tail_sampling/v1/tail_sampling_v2/v1/tail_sampling_config
Tail sampling receives grouped data on Dataway first, applies sampling rules, and sends the kept data upstream.
Three data types are currently supported:
tracingloggingrum
The basic flow is:
sequenceDiagram
autonumber
participant dk as Datakit/Client
participant dw as Dataway
participant ts as TailSamplingProcessor
participant kodo as Kodo
dk ->> dw: POST /v1/tail_sampling
alt config ready
dw ->> ts: ingest packet
ts ->> dw: kept packets
dw ->> kodo: write tracing/logging/rum
else config not ready
dw ->> dw: pending cache
dw -->> dk: 412 Precondition Failed
dk ->> dw: POST /v1/tail_sampling_config
dw ->> ts: update config and drain pending
end
Working Modes¶
Tail sampling uses the same mode settings as aggregate:
standaloneproxy
standalone¶
In standalone mode, the current Dataway handles tail-sampling data directly:
- receive protobuf-encoded
aggregate.DataPacket - look up sampling config by
token + data_type - ingest the packet into
TailSamplingProcessorwhen config is ready - periodically flush expired groups to the target write API
In the current implementation:
- the sampling loop advances every 1 second
- derived metrics are flushed every 1 minute
- a worker pool is used for async delivery
proxy¶
In proxy mode, the current Dataway does not keep local tail-sampling state:
/v1/tail_samplingand/v1/tail_sampling_v2are forwarded to backend nodes/v1/tail_sampling_configis broadcast to all backend nodes
That means in proxy mode:
aggregator_endpointis required- the client must send a valid
Guance-Pick-Key - backend nodes keep the actual sampling state
Warning
In Kubernetes, if the front Dataway needs to forward tail-sampling requests to fixed backend nodes, aggregator_endpoint must contain stable backend addresses. Backend Dataway nodes should be deployed with StatefulSet so each Pod keeps a stable address and DNS name for deterministic forwarding.
Local Configuration¶
There is no separate local YAML section for tail sampling. Dataway uses the same mode settings as aggregate:
Environment variables:
Notes:
standalone: the current node keeps tail-sampling state locallyproxy: the current node only forwards or broadcasts
In Kubernetes, when a front Dataway acts as the ingress layer and backend Dataway nodes perform the actual tail sampling, StatefulSet is the better backend deployment model and its stable Pod addresses should be used in aggregator_endpoint.
Sampling Config Delivery¶
Sampling rules are delivered by API instead of being written in dataway.yaml:
The request body is JSON with this top-level structure:
Where:
traceis the tracing tail-sampling configloggingis the logging tail-sampling configrumis the RUM tail-sampling config
Tracing Example¶
{
"version": 1,
"trace": {
"version": 1,
"data_ttl": "5m",
"group_key": "trace_id",
"pipelines": [
{
"name": "keep-all",
"type": "probabilistic",
"rate": 1
}
],
"builtin_metrics": [
{
"name": "trace_total_count",
"enabled": true
}
]
}
}
Notes:
trace.group_keycurrently only supportstrace_idtrace.data_ttldefaults to5mwhen emptypipelinessupportconditionandprobabilisticconditionusesaction=keep/dropprobabilisticusesrate=0~1
Logging Example¶
{
"version": 1,
"logging": {
"version": 1,
"data_ttl": "1m",
"group_dimensions": [
{
"group_key": "service",
"pipelines": [
{
"name": "keep-all",
"type": "probabilistic",
"rate": 1
}
]
}
]
}
}
RUM Example¶
{
"version": 1,
"rum": {
"version": 1,
"data_ttl": "1m",
"group_dimensions": [
{
"group_key": "session_id",
"pipelines": [
{
"name": "keep-all",
"type": "probabilistic",
"rate": 1
}
]
}
]
}
}
Info
logging and rum use group_dimensions for grouping. When data_ttl is empty, both default to 1m.
Warning
The current implementation validates the config. trace only allows group_key=trace_id, and derived_metrics is not supported yet.
Data APIs¶
Tail-sampling data APIs:
Notes:
- both APIs currently use the same handler
- in
standalonemode, the request body must be protobuf-encodedaggregate.DataPacket - in
proxymode, the request is forwarded to a backend node
412 And Pending Cache¶
In standalone mode, if Dataway has just started and the sampling config for the current token + data_type has not arrived yet:
- Dataway stores the packet in the local pending cache first
- then returns
412 Precondition Failed
Current behavior:
- the pending cache is in memory
- packets are grouped by
token + data_type - after config delivery succeeds, matching packets are drained into
TailSamplingProcessor - the current default limit is
100000packets
Expected client behavior:
- when the client receives
412, it should treat the packet as already accepted by Dataway - the client should continue sending
/v1/tail_sampling_config - the client should not resend the same data packet
Failure case:
- if the pending cache is full, Dataway returns
503 - in that case the request must not be treated as accepted
Built-in Sampling Metrics¶
Tail-sampling config supports builtin_metrics. These metrics are generated by the tail-sampling processor itself and flushed upstream periodically.
The current built-in metrics are:
tracing¶
trace_total_counttrace_kept_counttrace_dropped_counttrace_error_countspan_total_counttrace_duration
Where:
trace_durationis a duration-distribution metric- the others are count metrics
logging¶
logging_total_countlogging_error_countlogging_kept_countlogging_dropped_count
rum¶
rum_total_countrum_kept_countrum_dropped_count
Notes:
- when
builtin_metricsis empty, all supported built-in metrics for that data type are enabled by default - these metrics come from the sampling process itself, not from Dataway runtime observation
Automatic Dataway Metrics¶
Besides sampling builtin_metrics, apis/metrics_special.go also tracks Dataway runtime metrics for the tail-sampling APIs.
The current tail-sampling-related metrics are:
| Metric | Type | Labels | Description |
|---|---|---|---|
dataway_http_api_body_size_bytes_total |
Counter | api, token |
accumulated request body bytes for tail-sampling APIs |
dataway_http_tail_sampling_trace_total |
Counter | token |
received tracing group count |
dataway_http_tail_sampling_span_total |
Counter | token |
received tracing span count |
dataway_http_tail_sampling_packet_send_total |
Counter | token, data_type, result |
send result count, where result includes success, failure, and drop |
These metrics are:
- gathered every 1 minute
- converted into
dataway_aggregatepoints - sent to
/v1/write/metricwith the default Dataway token - reset after a reporting round
These metrics describe Dataway's own tail-sampling runtime behavior rather than the business-level sampling outcome.