Guance Data Sampling Technical Guide: Optimizing Data Volume and Query Efficiency¶

Functional Processing Logic¶

The data sampling feature of Guance spans the entire lifecycle of data processing:

Sampling during the collection phase: Filters data before ingestion, reducing data intake and storage costs.
Sampling during the query phase: Intelligently downsamples large-scale query results, improving chart rendering and data analysis speed.

Core Principles¶

The two phases of sampling are independent yet complementary: Sampling during the collection phase determines which data is permanently stored, directly impacting storage costs; sampling during the query phase only affects data presentation, with the original data remaining intact. Under the premise of ensuring key data integrity, resource consumption is optimized through sampling strategies.

Sampling Configuration in the Data Collection Phase¶

Here are three typical scenarios.

1. RUM Sampling¶

Configuration entry: Guance Console > RUM > Create/Edit Application > SDK Configuration.
Sampling logic:
- Controls the data reporting ratio by setting the sampleRate parameter (e.g., sampleRate: 90).
- Generates a random number (0-100) during SDK initialization; if the random number is less than the sampling rate, the data is reported.
Use Cases: Web/Android/iOS/Mini Program applications that need to reduce the storage cost of user behavior data.

2. APM Sampling¶

Supported methods:
- Set sampling rates through code instrumentation (e.g., DDTrace, OpenTelemetry);
- Configure trace data sampling rules in the DataKit collector.
Sampling strategy:
- Full collection: Error requests, slow requests;
- Downsampling: Normal requests (sampling rate can be set to 1%-10%).

3. Log Data Sampling¶

Configuration capabilities:
- Use the drop() function in Pipeline to discard redundant logs, or the sample() function for proportional sampling;
- Supports configuring blacklists to filter low-value logs.
Typical scenario: Apply a low sampling rate to debug logs and full collection to error logs.

Sampling Function in the Query Phase¶

Guance provides an intelligent sampling mechanism during the query phase, which is automatically triggered when the query data volume reaches preset thresholds, ensuring the response performance of large-scale data queries.

Key sampling thresholds set by the system include:

Dashboard queries: 200 million data points
Explorer queries: 100 million data points
Facet queries: 5 million data points
Other general queries: 200 million data points

Sampling Strategy Selection Guide¶

Scenario	Sampling Type	Recommended Configuration	Notes
User Behavior Analysis	RUM Collection Sampling	Set sampling rate based on business importance	Ensure full collection of error operations and critical paths
Application Performance Troubleshooting	APM Trace Sampling	Full collection of error/slow requests, sampling rate of 1%-10% for normal requests	Ensure trace integrity through TraceID
Long-term Log Storage	Log Pipeline Sampling	Sampling rate ≤5% for high-frequency logs, full collection of error logs	Combine with sensitive data masking rules
Dashboard Macro Trends	Query Sampling	Enable sampling for time ranges > 24 hours	Disable sampling and narrow the time range when troubleshooting specific issues.
Real-time Alerts	No Sampling	Calculate based on raw data	Avoid false positives/negatives caused by sampling

Key Considerations¶

Data Consistency: Sampling may dilute extreme values; important decisions should be validated with full data.
Cost-Performance Balance: Collection sampling reduces storage costs, while query sampling improves response speed.
Dynamic Optimization: Regularly check the statistical distribution of sampled data and adjust strategies to adapt to business changes.

Summary¶

The sampling feature of Guance is a multi-level, customizable tool for optimizing cost and performance:

Collection-side sampling directly reduces data inflow, suitable for high-frequency data such as RUM, APM, and logs.
Query-side sampling ensures smooth interaction with large-scale data, suitable for dashboards and historical analysis.

By combining business priorities (e.g., full collection of errors, downsampling of normal data) with query needs (enable for macro trends, disable for precise troubleshooting), the optimal balance between resource efficiency and data reliability can be achieved.