Guance Data Sampling Technical Guide: Optimizing Data Volume and Query Efficiency¶
Functional Processing Logic¶
The data sampling feature of Guance spans the entire lifecycle of data processing:
-
Sampling during the collection phase: Filters data before ingestion, reducing data intake and storage costs.
-
Sampling during the query phase: Intelligently downsamples large-scale query results, improving chart rendering and data analysis speed.
Core Principles¶
The two phases of sampling are independent yet complementary: Sampling during the collection phase determines which data is permanently stored, directly impacting storage costs; sampling during the query phase only affects data presentation, with the original data remaining intact. Under the premise of ensuring key data integrity, resource consumption is optimized through sampling strategies.
Sampling Configuration in the Data Collection Phase¶
Here are three typical scenarios.
1. RUM Sampling¶
-
Configuration entry: Guance Console > RUM > Create/Edit Application > SDK Configuration.
-
Sampling logic:
-
Controls the data reporting ratio by setting the
sampleRateparameter (e.g.,sampleRate: 90). -
Generates a random number (0-100) during SDK initialization; if the random number is less than the sampling rate, the data is reported.
-
-
Use Cases: Web/Android/iOS/Mini Program applications that need to reduce the storage cost of user behavior data.
2. APM Sampling¶
-
Supported methods:
-
Set sampling rates through code instrumentation (e.g., DDTrace, OpenTelemetry);
-
Configure trace data sampling rules in the DataKit collector.
-
-
Sampling strategy:
-
Full collection: Error requests, slow requests;
-
Downsampling: Normal requests (sampling rate can be set to 1%-10%).
-
3. Log Data Sampling¶
-
Configuration capabilities:
-
Use the
drop()function in Pipeline to discard redundant logs, or thesample()function for proportional sampling; -
Supports configuring blacklists to filter low-value logs.
-
-
Typical scenario: Apply a low sampling rate to debug logs and full collection to error logs.
Sampling Function in the Query Phase¶
Guance provides an intelligent sampling mechanism during the query phase, which is automatically triggered when the query data volume reaches preset thresholds, ensuring the response performance of large-scale data queries.
Key sampling thresholds set by the system include:
-
Dashboard queries: 200 million data points
-
Explorer queries: 100 million data points
-
Facet queries: 5 million data points
-
Other general queries: 200 million data points
Sampling Strategy Selection Guide¶
| Scenario | Sampling Type | Recommended Configuration | Notes |
|---|---|---|---|
| User Behavior Analysis | RUM Collection Sampling | Set sampling rate based on business importance | Ensure full collection of error operations and critical paths |
| Application Performance Troubleshooting | APM Trace Sampling | Full collection of error/slow requests, sampling rate of 1%-10% for normal requests | Ensure trace integrity through TraceID |
| Long-term Log Storage | Log Pipeline Sampling | Sampling rate ≤5% for high-frequency logs, full collection of error logs | Combine with sensitive data masking rules |
| Dashboard Macro Trends | Query Sampling | Enable sampling for time ranges > 24 hours | Disable sampling and narrow the time range when troubleshooting specific issues. |
| Real-time Alerts | No Sampling | Calculate based on raw data | Avoid false positives/negatives caused by sampling |
Key Considerations¶
-
Data Consistency: Sampling may dilute extreme values; important decisions should be validated with full data.
-
Cost-Performance Balance: Collection sampling reduces storage costs, while query sampling improves response speed.
-
Dynamic Optimization: Regularly check the statistical distribution of sampled data and adjust strategies to adapt to business changes.
Summary¶
The sampling feature of Guance is a multi-level, customizable tool for optimizing cost and performance:
-
Collection-side sampling directly reduces data inflow, suitable for high-frequency data such as RUM, APM, and logs.
-
Query-side sampling ensures smooth interaction with large-scale data, suitable for dashboards and historical analysis.
By combining business priorities (e.g., full collection of errors, downsampling of normal data) with query needs (enable for macro trends, disable for precise troubleshooting), the optimal balance between resource efficiency and data reliability can be achieved.