Skip to content

Pipelines


Pipelines is a lightweight scripting language that runs on DataKit, used for custom parsing and modification of collected data. By defining parsing rules, it can finely slice and convert different types of data into structured formats to meet specific data management needs. For example, users can extract timestamps, statuses, and other key fields from logs via Pipelines and use this information as labels.

DataKit leverages the powerful functionality of Pipelines, allowing users to directly write and debug Pipeline scripts on the workspace page, thereby achieving finer-grained structured processing of data. This not only improves the manageability of data but also, through the rich function library provided by Pipeline, supports standardized operations on common data, such as parsing time strings and completing geographical information for IP addresses.

The main features of Pipeline include:

  • As a lightweight scripting language, Pipeline provides efficient data processing capabilities;
  • It has a rich function library supporting standardized operations on various common data types;
  • Users can directly write and debug Pipeline scripts on the workspace page, making the creation and batch activation of scripts more convenient.

Currently, Guance supports configuring local Pipelines and central Pipelines.

  • Local Pipeline: Runs during data collection, requiring DataKit collector version 1.5.0 or higher;
  • Central Pipeline: Runs after data is uploaded to the console center;

Use Cases

Type
Scenario
Local Pipeline Process logs before data forwarding.
Central Pipeline 1. User access (Session) data, Profiling data, Synthetic Tests data;
2. Process user access data in the chain, such as extracting session, view, resource fields from the message in the chain.

Data not mentioned above can be processed by both local/central Pipelines.

Prerequisites

To ensure normal use of Pipeline, please upgrade DataKit to version 1.5.0 or higher. A lower version may cause some Pipeline functions to fail.

In versions of DataKit<1.5.0:

  • Default Pipeline functionality is not supported;

  • Data sources do not support multiple selections; each Pipeline can only choose one source. Therefore, if your version is below 1.5.0 and you select multiple data sources, it will not take effect;

  • Pipeline names are fixed and cannot be modified. For example, if the log source is selected as nginx, then the Pipeline name is fixed as nginx.p. Therefore, if your version is below 1.5.0 and the Pipeline name does not match the data source name, the Pipeline will not take effect.


This feature requires paid usage.


Create

In the workspace Management > Pipelines, click Create Pipeline.

Alternatively, you can create by clicking Pipelines in the menu directory entries for Metrics, Logs, User Analysis, APM, Infrastructure, Security Check.

Note

After creating a Pipeline file, DataKit must be installed for it to take effect. DataKit will periodically retrieve configured Pipeline files from the workspace, with a default interval of 1 minute, which can be modified in conf.d/datakit.conf.

[pipeline]
  remote_pull_interval = "1m"
  1. Select Pipeline type;
  2. Select data type and add filtering conditions;
  3. Input Pipeline name, i.e., the custom Pipeline filename;
  4. Provide test samples;
  5. Input function scripts and configure parsing rules;
  6. Click save.
Note
  • If the filtering object is selected as logs, the system will automatically filter out TESTING data, even if the Pipeline is set as default, it will not apply to TESTING data.
  • If the filtering object is selected as "Synthetic Testing", the type will automatically be set to "Central Pipeline" and cannot choose a local Pipeline.
  • Pipeline filenames must avoid duplication.
  • Each data type only supports setting one default Pipeline. When creating or importing, if there's a duplicate, the system will prompt a confirmation dialog asking whether to replace. The name of a Pipeline already set as default will display the default identifier.

Test Samples

Based on the selected data type, input corresponding data to test according to the configured parsing rules.

  1. One-click sample acquisition: Automatically retrieves already collected data, including Message and all fields;
  2. Add: You can add multiple sample data entries (up to 3).
Note

Pipeline files created in the workspace are uniformly saved under the <datakit installation directory>/pipeline_remote directory. Among them:

  • Files under the first-level directory are default Log Pipelines.
  • Each type of Pipeline file is saved in the corresponding second-level directory. For example, the Metric Pipeline file cpu.p is saved in the path <datakit installation directory>/pipeline_remote/metric/cpu.p.

For more details, refer to Pipeline Category Data Processing.

One-click Sample Acquisition

When creating/editing a Pipeline, click Sample Parsing Test > One-click Sample Acquisition, the system will automatically select the latest data from the data already collected and reported to the workspace within the filtered data range, and fill it into the test sample box for testing. Each click of "One-click Sample Acquisition" will only query data within the last 6 hours, and if no data has been reported in the last 6 hours, automatic sample acquisition will not be possible.

Debugging Example:

Below is a one-click acquired Metric data sample, the Measurement is cpu, and the labels are cpu and host. From usage_guest to usage_user, all fields are metric data, and the final 1667732804738974000 is the timestamp. Through the returned result, the data structure of the one-click acquired sample can be clearly understood.

Manual Sample Input

You can also manually input sample data for testing, supporting two format types:

  • Log data can directly input message content for testing in the sample parsing test;
  • Other data types should first convert the content into "line protocol" format, then input it for sample parsing testing.

For more details about Log Pipelines, refer to Log Pipeline User Manual.

Line Protocol Example

  • cpu, redis are Measurements; tag1, tag2 are label sets; f1, f2, f3 are field sets (f1=1i indicates int, f2=1.2 indicates default float, f3="abc" indicates string); 162072387000000000 is the timestamp;
  • Measurements and label sets are separated by commas; multiple labels are separated by commas;
  • Label sets and field sets are separated by spaces; multiple fields are separated by commas;
  • Field sets and timestamps are separated by spaces; timestamps are mandatory;
  • If the data is object data, it must have a name label, otherwise the protocol will error; it’s best to have a message field mainly for convenience in full-text search.

For more details about line protocols, refer to DataKit API.

More ways to obtain line protocol data can be configured by setting output_file in conf.d/datakit.conf and viewing the line protocol in that file.

[io]
  output_file = "/path/to/file"

Define Parsing Rules

By manually writing or AI-defining parsing rules for different data sources, multiple script functions are supported. You can directly view their syntax formats from the script function list provided on the right side, such as add_pattern().

For how to define parsing rules, refer to Pipeline Manual.

Manual Writing

Autonomously write data parsing rules, text auto-wrapping or content overflow settings can be adjusted.

AI Generation

AI-generated parsing rules are based on model-generated Pipeline parsing aimed at quickly providing an initial parsing solution.

Note

Since the rules generated by the model may not cover all complex cases or scenarios, the returned results may not be entirely accurate. It is recommended to use it as a reference and starting point, and make further adjustments and optimizations based on specific log formats and requirements after generation.

Now, based on the content and names needed to be extracted from the sample input, for example:

-"date_pl":"2024-12-25 07:25:33.525",
-"m_pl":"[INFO][66] route_table.go 237: Queueing a resync of routing table. ipVersion=0x4"

Click Generate Pipeline:

After testing, the returned result is like:

For more details, refer to Rule Writing Guide.

Start Testing

On the Pipeline editing page, you can test the parsing rules that have been filled in. Just input data in the Sample Parsing Test section to test. If the parsing rule does not match, it will return an error prompt result. Sample parsing tests are optional, and the tested data will be synchronized and saved after testing.

Terminal Command-line Debugging

In addition to debugging Pipelines on the console, you can also debug Pipelines via terminal command lines.

For more details, refer to How to Write Pipeline Scripts.

More Reading

Feedback

Is this page helpful? ×