Pipelines¶
Pipelines is a lightweight scripting language that runs on DataKit, used for custom parsing and modification of collected data. By defining parsing rules, it can finely slice and convert different types of data into structured formats to meet specific data management needs. For example, users can extract timestamps, statuses, and other key fields from logs via Pipelines and use this information as labels.
DataKit leverages the powerful functionality of Pipelines, allowing users to directly write and debug Pipeline scripts on the workspace page, thereby achieving finer-grained structured processing of data. This not only improves the manageability of data but also, through the rich function library provided by Pipeline, supports standardized operations on common data, such as parsing time strings and completing geographical information for IP addresses.
The main features of Pipeline include:
- As a lightweight scripting language, Pipeline provides efficient data processing capabilities;
- It has a rich function library supporting standardized operations on various common data types;
- Users can directly write and debug Pipeline scripts on the workspace page, making the creation and batch activation of scripts more convenient.
Currently, Guance supports configuring local Pipelines and central Pipelines.
- Local Pipeline: Runs during data collection, requiring DataKit collector version 1.5.0 or higher;
- Central Pipeline: Runs after data is uploaded to the console center;
Use Cases¶
Type |
Scenario |
---|---|
Local Pipeline | Process logs before data forwarding. |
Central Pipeline | 1. User access (Session) data, Profiling data, Synthetic Tests data; 2. Process user access data in the chain, such as extracting session , view , resource fields from the message in the chain. |
Data not mentioned above can be processed by both local/central Pipelines.
Prerequisites¶
- Install DataKit;
- DataKit version must be >= 1.5.0.
To ensure normal use of Pipeline, please upgrade DataKit to version 1.5.0 or higher. A lower version may cause some Pipeline functions to fail.
In versions of DataKit<1.5.0
:
-
Default Pipeline functionality is not supported;
-
Data sources do not support multiple selections; each Pipeline can only choose one
source
. Therefore, if your version is below 1.5.0 and you select multiple data sources, it will not take effect; -
Pipeline names are fixed and cannot be modified. For example, if the log source is selected as
nginx
, then the Pipeline name is fixed asnginx.p
. Therefore, if your version is below 1.5.0 and the Pipeline name does not match the data source name, the Pipeline will not take effect.
This feature requires paid usage.
Create¶
In the workspace Management > Pipelines, click Create Pipeline.
Alternatively, you can create by clicking Pipelines in the menu directory entries for Metrics, Logs, User Analysis, APM, Infrastructure, Security Check.
Note
After creating a Pipeline file, DataKit must be installed for it to take effect. DataKit will periodically retrieve configured Pipeline files from the workspace, with a default interval of 1 minute, which can be modified in conf.d/datakit.conf
.
- Select Pipeline type;
- Select data type and add filtering conditions;
- Input Pipeline name, i.e., the custom Pipeline filename;
- Provide test samples;
- Input function scripts and configure parsing rules;
- Click save.
Note
- If the filtering object is selected as logs, the system will automatically filter out TESTING data, even if the Pipeline is set as default, it will not apply to TESTING data.
- If the filtering object is selected as "Synthetic Testing", the type will automatically be set to "Central Pipeline" and cannot choose a local Pipeline.
- Pipeline filenames must avoid duplication.
- Each data type only supports setting one default Pipeline. When creating or importing, if there's a duplicate, the system will prompt a confirmation dialog asking whether to replace. The name of a Pipeline already set as default will display the
default
identifier.
Test Samples¶
Based on the selected data type, input corresponding data to test according to the configured parsing rules.
- One-click sample acquisition: Automatically retrieves already collected data, including Message and all fields;
- Add: You can add multiple sample data entries (up to 3).
Note
Pipeline files created in the workspace are uniformly saved under the <datakit installation directory>/pipeline_remote
directory. Among them:
- Files under the first-level directory are default Log Pipelines.
- Each type of Pipeline file is saved in the corresponding second-level directory. For example, the Metric Pipeline file
cpu.p
is saved in the path<datakit installation directory>/pipeline_remote/metric/cpu.p
.
For more details, refer to Pipeline Category Data Processing.
One-click Sample Acquisition¶
When creating/editing a Pipeline, click Sample Parsing Test > One-click Sample Acquisition, the system will automatically select the latest data from the data already collected and reported to the workspace within the filtered data range, and fill it into the test sample box for testing. Each click of "One-click Sample Acquisition" will only query data within the last 6 hours, and if no data has been reported in the last 6 hours, automatic sample acquisition will not be possible.
Debugging Example:
Below is a one-click acquired Metric data sample, the Measurement is cpu
, and the labels are cpu
and host
. From usage_guest
to usage_user
, all fields are metric data, and the final 1667732804738974000
is the timestamp. Through the returned result, the data structure of the one-click acquired sample can be clearly understood.
Manual Sample Input¶
You can also manually input sample data for testing, supporting two format types:
- Log data can directly input
message
content for testing in the sample parsing test; - Other data types should first convert the content into "line protocol" format, then input it for sample parsing testing.
For more details about Log Pipelines, refer to Log Pipeline User Manual.
Line Protocol Example¶
cpu
,redis
are Measurements;tag1
,tag2
are label sets;f1
,f2
,f3
are field sets (f1=1i
indicatesint
,f2=1.2
indicates defaultfloat
,f3="abc"
indicatesstring
);162072387000000000
is the timestamp;- Measurements and label sets are separated by commas; multiple labels are separated by commas;
- Label sets and field sets are separated by spaces; multiple fields are separated by commas;
- Field sets and timestamps are separated by spaces; timestamps are mandatory;
- If the data is object data, it must have a
name
label, otherwise the protocol will error; it’s best to have amessage
field mainly for convenience in full-text search.
For more details about line protocols, refer to DataKit API.
More ways to obtain line protocol data can be configured by setting output_file
in conf.d/datakit.conf
and viewing the line protocol in that file.
Define Parsing Rules¶
By manually writing or AI-defining parsing rules for different data sources, multiple script functions are supported. You can directly view their syntax formats from the script function list provided on the right side, such as add_pattern()
.
For how to define parsing rules, refer to Pipeline Manual.
Manual Writing¶
Autonomously write data parsing rules, text auto-wrapping or content overflow settings can be adjusted.
AI Generation¶
AI-generated parsing rules are based on model-generated Pipeline parsing aimed at quickly providing an initial parsing solution.
Note
Since the rules generated by the model may not cover all complex cases or scenarios, the returned results may not be entirely accurate. It is recommended to use it as a reference and starting point, and make further adjustments and optimizations based on specific log formats and requirements after generation.
Now, based on the content and names needed to be extracted from the sample input, for example:
-"date_pl":"2024-12-25 07:25:33.525",
-"m_pl":"[INFO][66] route_table.go 237: Queueing a resync of routing table. ipVersion=0x4"
Click Generate Pipeline:
After testing, the returned result is like:
For more details, refer to Rule Writing Guide.
Start Testing¶
On the Pipeline editing page, you can test the parsing rules that have been filled in. Just input data in the Sample Parsing Test section to test. If the parsing rule does not match, it will return an error prompt result. Sample parsing tests are optional, and the tested data will be synchronized and saved after testing.
Terminal Command-line Debugging¶
In addition to debugging Pipelines on the console, you can also debug Pipelines via terminal command lines.
For more details, refer to How to Write Pipeline Scripts.