Pipelines¶

Pipelines are lightweight scripting languages running on DataKit, used for custom parsing and modification of collected data. By defining parsing rules, they can finely slice and convert various types of data into structured formats to meet specific data management needs. For example, users can extract timestamps, statuses, and other key fields from logs using Pipelines and use this information as tags.

DataKit leverages the powerful capabilities of Pipelines, allowing users to write and debug Pipeline scripts directly on the workspace page, enabling more granular structured processing of data. This processing not only enhances data manageability but also supports standardization operations on common data through the rich function library provided by Pipeline, such as parsing time strings and complementing geographic information of IP addresses.

Key features of Pipeline include:

As a lightweight scripting language, Pipeline provides efficient data processing capabilities.
It has a rich function library that supports standardization operations on various common data types.
Users can write and debug Pipeline scripts directly on the workspace page, making script creation and batch activation more convenient.

Currently, Guance supports configuring local Pipelines and central Pipelines.

Local Pipeline: Runs during data collection, requiring DataKit collector version 1.5.0 or higher.
Central Pipeline: Runs after data is uploaded to the console center.

Use Cases¶

Type	Scenarios
Local Pipeline	Processes logs before data forwarding.
Central Pipeline	1. User access (Session) data, Profiling data, Synthetic Tests data. 2. Processes user access data in the trace, such as extracting `session`, `view`, `resource` fields from the trace `message`.

In addition to the above, both local and central Pipelines can process other data.

Prerequisites¶

Local PipelineCentral Pipeline

Install DataKit.
DataKit version requirement >= 1.5.0.

To ensure normal use of Pipeline, please upgrade DataKit to version 1.5.0 or higher. Versions lower than 1.5.0 may cause some Pipeline functions to fail.

In DataKit<1.5.0 versions:

Default Pipeline functionality is not supported.
Data sources do not support multiple selections; each Pipeline can only select one source. Therefore, if your version is lower than 1.5.0 and multiple data sources are selected, it will not take effect.
Pipeline names are fixed and cannot be modified. For example: if the log source is nginx, the Pipeline name is fixed as nginx.p. Therefore, if your version is lower than 1.5.0 and the Pipeline name does not match the data source name, the Pipeline will not take effect.

This feature requires a paid plan.

Create¶

In the workspace Management > Pipelines, click Create Pipeline.

Alternatively, you can create a Pipeline by clicking Pipelines in the Metrics, Logs, User Access, Application Performance, or Infrastructure menu directories.

Note

After creating a Pipeline file, you need to install DataKit for it to take effect. DataKit periodically fetches the configured Pipeline files from the workspace, with a default interval of 1 minute, which can be modified in conf.d/datakit.conf.

[pipeline]
  remote_pull_interval = "1m"

Select the Pipeline type.
Select the data type and add filter conditions.
Enter the Pipeline name, i.e., the custom Pipeline file name.
Provide a test sample.
Enter the function script and configure parsing rules.
Save.

Note

If the filter object is set to logs, the system will automatically filter out testing data. Even if the Pipeline is set as default, it will not apply to testing data.
If the filter object is set to "Synthetic Tests", the type will automatically be set to "Central Pipeline", and local Pipeline cannot be selected.
Pipeline file naming needs to avoid duplication.
Only one default Pipeline can be set for each data type. If a duplicate is found during creation or import, the system will prompt a confirmation dialog asking whether to replace it. Pipelines set as default will have a default identifier displayed after their names.

Test Sample¶

Based on the selected data type, input corresponding data to test against the configured parsing rules.

One-click sample: Automatically fetches already collected data, including Message and all fields.
Add: Can add multiple sample data (up to 3).

Note

Pipeline files created in the workspace are uniformly saved in the <datakit installation directory>/pipeline_remote directory. Among them:

Files in the top-level directory are default log Pipelines.
Pipeline files for each type are saved in their corresponding subdirectories. For example, the metric Pipeline file cpu.p is saved in the <datakit installation directory>/pipeline_remote/metric/cpu.p path.

For more details, refer to Pipeline Category Data Processing.

One-click Sample¶

When creating or editing a Pipeline, click Sample Parsing Test > One-click Sample, and you can choose whether to randomly fetch data based on the specified log source or specify a specific data source.

The system will automatically fetch the latest piece of data from the collected and reported data in the workspace based on the filtered data range and fill it into the test sample box for testing. Each time you click "One-click Sample", the system only queries data from the last 6 hours. If no data has been reported in the last 6 hours, the sample cannot be automatically fetched.

Debugging Example:

Below is a one-click fetched metric data sample, with the measurement as cpu and tags as cpu and host. Fields from usage_guest to usage_user are metric data, and the last 1667732804738974000 is the timestamp. The returned results clearly show the data structure of the one-click fetched sample.

Manual Input Sample¶

You can also manually input sample data for testing, supporting two format types:

Log data can directly input message content in the sample parsing test.
Other data types need to convert the content into "line protocol" format before inputting for sample parsing test.

For more details on log Pipelines, refer to Log Pipeline Manual.

Line Protocol Example¶

cpu, redis are measurements; tag1, tag2 are tag sets; f1, f2, f3 are field sets (where f1=1i represents int, f2=1.2 defaults to float, f3="abc" represents string); 162072387000000000 is the timestamp.
Measurements and tag sets are separated by commas; multiple tags are separated by commas.
Tag sets and field sets are separated by spaces; multiple fields are separated by commas.
Field sets and timestamps are separated by spaces; timestamps are mandatory.
For object data, the name tag is required, otherwise the protocol will report an error; it is best to have the message field for full-text search.

For more details on line protocol, refer to DataKit API.

More ways to obtain line protocol data can be configured in conf.d/datakit.conf by setting the output_file output file and viewing the line protocol in that file.

[io]
  output_file = "/path/to/file"

Define Parsing Rules¶

Manually write or use AI to define parsing rules for different data sources, supporting multiple script functions. You can directly view the syntax format of provided script functions on the right side, such as add_pattern().

For how to define parsing rules, refer to Pipeline Manual.

Manual Writing¶

Manually write data parsing rules, with options for text auto-wrapping or content overflow.

AI Generation¶

AI-generated parsing rules are based on models to quickly provide initial parsing solutions.

Note

Due to the complexity of some scenarios, the rules generated by the model may not cover all cases, and the returned results may not be entirely accurate. It is recommended to use them as references and starting points, and further adjust and optimize based on specific log formats and requirements.

Now, based on the sample input, specify the content and names to be extracted, such as:

-"date_pl":"2024-12-25 07:25:33.525",
-"m_pl":"[INFO][66] route_table.go 237: Queueing a resync of routing table. ipVersion=0x4"

Click to generate Pipeline:

After testing, the returned results are:

For more details, refer to Rule Writing Guide.

Start Testing¶

On the Pipeline editing page, you can test the already filled parsing rules by inputting data in the Sample Parsing Test. If the parsing rules do not match, an error prompt will be returned. Sample parsing test is optional, and the tested data will be saved synchronously.

Terminal Command Line Debugging¶

In addition to debugging Pipelines in the console, you can also debug Pipelines via terminal command line.

For more details, refer to How to Write Pipeline Scripts.

Pipelines¶

Use Cases¶

Prerequisites¶

Create¶

Test Sample¶

One-click Sample¶

Manual Input Sample¶

Line Protocol Example¶

Define Parsing Rules¶

Manual Writing¶

AI Generation¶

Start Testing¶

Terminal Command Line Debugging¶

Further Reading¶

Is this page helpful? ×