Dataway Tail Sampling¶

Features¶

Dataway provides tail sampling capability, with external interfaces including:

/v1/tail_sampling
/v1/tail_sampling_v2
/v1/tail_sampling_config

Tail sampling is used to first receive data grouped and packaged on the Dataway side, then decide whether to keep or discard it based on sampling rules, and finally write the retained data to the center.

Currently, three types of data are supported:

tracing
logging
rum

The basic processing flow is as follows:

sequenceDiagram
autonumber

participant dk as Datakit/Client
participant dw as Dataway
participant ts as TailSamplingProcessor
participant kodo as Kodo

dk ->> dw: POST /v1/tail_sampling
alt config ready
    dw ->> ts: ingest packet
    ts ->> dw: kept packets
    dw ->> kodo: write tracing/logging/rum
else config not ready
    dw ->> dw: pending cache
    dw -->> dk: 412 Precondition Failed
    dk ->> dw: POST /v1/tail_sampling_config
    dw ->> ts: update config and drain pending
end

Working Modes¶

Tail sampling and aggregation share the same set of mode configurations:

standalone
proxy

standalone¶

In standalone mode, the current Dataway directly processes tail sampling data:

Receives protobuf-encoded aggregate.DataPacket
Looks up tail sampling configuration based on token + data_type
When the configuration is ready, writes directly to TailSamplingProcessor
Periodically retrieves expired groups and sends them to the corresponding data type write interface

In the current implementation:

Sampling window advancement period is 1 second
Derived metric refresh period is 1 minute
The sending phase uses a worker pool for asynchronous writes

proxy¶

In proxy mode, the current Dataway does not retain local tail sampling state:

/v1/tail_sampling and /v1/tail_sampling_v2 are forwarded to backend nodes
/v1/tail_sampling_config is broadcast to all backend nodes

Therefore, in proxy mode:

aggregator_endpoint must be configured
Clients need to carry a valid Guance-Pick-Key
Backend nodes are responsible for the actual sampling and state maintenance

Warning

In Kubernetes deployments, if the frontend Dataway needs to stably forward tail sampling requests to fixed backend nodes, then aggregator_endpoint must be filled with stable, unchanging backend addresses. It is recommended here to deploy the backend Dataway using StatefulSet to ensure Pod addresses and DNS names are stable, facilitating fixed forwarding by the frontend Dataway.

Local Configuration¶

Dataway does not have separate tail sampling YAML configuration items locally; tail sampling uses the same mode configuration as aggregation:

aggregator_mode: standalone
aggregator_endpoint:
  - http://dataway-0:9528
  - http://dataway-1:9528

Environment variables:

DW_AGGREGATOR_MODE=standalone
DW_AGGREGATOR_ENDPOINTS=http://dataway-0:9528,http://dataway-1:9528

Explanation:

standalone: The current node holds the tail sampling state itself
proxy: The current node only performs forwarding or broadcasting

In Kubernetes, if the frontend Dataway serves as the entry layer and the backend Dataway is responsible for actual tail sampling, the backend nodes are more suitable for deployment using StatefulSet, and the stable addresses of the StatefulSet Pods should be written into aggregator_endpoint.

Sampling Configuration Distribution¶

Tail sampling rules are distributed via an interface, not written in dataway.yaml:

POST /v1/tail_sampling_config

The request body is JSON, with the top-level structure as follows:

{
  "version": 1,
  "trace": {},
  "logging": {},
  "rum": {}
}

Where:

trace corresponds to tracing tail sampling configuration
logging corresponds to logging tail sampling configuration
rum corresponds to rum tail sampling configuration

tracing Configuration Example¶

{
  "version": 1,
  "trace": {
    "version": 1,
    "data_ttl": "5m",
    "group_key": "trace_id",
    "pipelines": [
      {
        "name": "keep-all",
        "type": "probabilistic",
        "rate": 1
      }
    ],
    "builtin_metrics": [
      {
        "name": "trace_total_count",
        "enabled": true
      }
    ]
  }
}

Explanation:

trace.group_key can currently only be trace_id
trace.data_ttl defaults to 5m when empty
pipelines supports condition and probabilistic
condition uses action=keep/drop
probabilistic uses rate=0~1

logging Configuration Example¶

{
  "version": 1,
  "logging": {
    "version": 1,
    "data_ttl": "1m",
    "group_dimensions": [
      {
        "group_key": "service",
        "pipelines": [
          {
            "name": "keep-all",
            "type": "probabilistic",
            "rate": 1
          }
        ]
      }
    ]
  }
}

rum Configuration Example¶

{
  "version": III,
  "rum": {
    "version": 1,
    "data_ttl": "1m",
    "group_dimensions": [
      {
        "group_key": "session_id",
        "pipelines": [
          {
            "name": "keep-all",
            "type": "probabilistic",
            "rate": 1
          }
        ]
      }
    ]
  }
}

Info

logging and rum use group_dimensions to configure grouping dimensions; data_ttl defaults to 1m for both when empty.

Warning

The current implementation validates configuration content. trace only allows group_key=trace_id; derived_metrics is not currently supported, and configuring it will return an error.

Data Reporting Interface¶

Tail sampling data interfaces:

POST /v1/tail_sampling
POST /v1/tail_sampling_v2

Explanation:

Both interfaces currently go through the same processing logic
In standalone mode, the request body needs to be a protobuf-encoded aggregate.DataPacket
In proxy mode, requests are forwarded to backend nodes

412 and pending cache¶

In standalone mode, if Dataway has just started and the sampling configuration for the corresponding token + data_type has not been distributed yet:

Dataway will first put this batch of data into the local pending cache
Then return 412 Precondition Failed

Current behavior:

pending cache is an in-memory cache
Temporarily stores data by token + data_type
After successful configuration distribution, automatically drains available data to TailSamplingProcessor
Currently defaults to caching a maximum of 100000 packets

Agreed behavior:

After receiving 412, the client considers this batch of data to have been accepted by Dataway
The client only needs to continue sending /v1/tail_sampling_config
The client does not need to resend this batch of data

Exception cases:

If the pending cache is full, Dataway returns 503
In this case, the request cannot be considered as received

Tail Sampling Built-in Metrics¶

Tail sampling configuration supports builtin_metrics. These metrics are generated by the tail sampling processor during the sampling process and written to the center during periodic refreshes.

Current built-in metrics are as follows.

tracing¶

trace_total_count
trace_kept_count
trace_dropped_count
trace_error_count
span_total_count
trace_duration

Where:

trace_duration is a duration distribution metric
Others are count metrics

logging¶

logging_total_count
logging_error_count
logging_kept_count
logging_dropped_count

rum¶

rum_total_count
rum_kept_count
rum_dropped_count

Explanation:

When builtin_metrics is empty, it currently defaults to enabling all built-in metrics supported by that data type
These metrics come from the tail sampling process itself, not Dataway's own operational metrics

Dataway Automatic Reporting Metrics¶

In addition to the sampler's own builtin_metrics, apis/metrics_special.go also automatically maintains a set of Dataway self-observation metrics to describe the processing status of the tail sampling API.

Metrics currently related to tail sampling include:

Metric Name	Type	Tags	Description
`dataway_http_api_body_size_bytes_total`	Counter	`api`, `token`	Cumulative bytes of tail sampling interface request body
`dataway_http_tail_sampling_trace_total`	Counter	`token`	Number of received tracing groups
`dataway_http_tail_sampling_span_total`	Counter	`token`	Total number of received tracing spans
`dataway_http_tail_sampling_packet_send_total`	Counter	`token`, `data_type`, `result`	Send result statistics, `result` includes `success`, `failure`, `drop`

These metrics will:

Be collected every 1 minute
Be converted to dataway_aggregate metric points
Be reported to /v1/write/metric using the Dataway default token
Reset current accumulated values after reporting

This set of metrics reflects the operational status of Dataway itself processing tail sampling traffic, not the business statistics of the sampling rules themselves.

Dataway Tail Sampling¶

Features¶

Working Modes¶

standalone¶

proxy¶

Local Configuration¶

Sampling Configuration Distribution¶

tracing Configuration Example¶

logging Configuration Example¶

rum Configuration Example¶

Data Reporting Interface¶

412 and pending cache¶

Tail Sampling Built-in Metrics¶

tracing¶

logging¶

rum¶

Dataway Automatic Reporting Metrics¶

Is this page helpful? ×