DataKit Log Processing Overview
This document introduces how DataKit processes logs. In another document, we explained how DataKit collects logs. These two documents should be read together to gain a comprehensive understanding of the entire log processing pipeline.
Key questions:
- Why is the configuration for log collection so complex?
- How is log data processed?
Why is Log Collection Configuration So Complex?¶
From this document, we know that because logs come from a variety of sources, the configuration methods are also diverse. We need to clarify this here to help you understand.
During the log collection process, DataKit has two main types of collection methods:
-
Active Collection
- Directly collect disk file logs
- Collect logs generated by containers
-
Passive Collection
In these different forms of log collection, they all need to solve the same core issue: How does DataKit handle these logs next?
Breaking down this core issue further, it can be divided into the following sub-questions:
- Determining what the
source
is: All subsequent log processing depends on this field (there is an additionalservice
field, but if not specified, its value will be set the same as the source) - How to configure the Pipeline: Although not mandatory, it is widely used
- Additional Tag Configuration: Also not mandatory, but sometimes it has special functions
- How to split multi-line logs: Need to tell DataKit how the target logs are separated into individual logs (by default, DataKit considers each line starting with non-whitespace characters as a new log)
- Whether there are specific ignore policies: Not all data collected by DataKit needs to be processed; if certain conditions are met, they can be chosen not to be collected (even though they meet the collection criteria)
- Other special configurations: Such as filtering color characters, text encoding handling, etc.
Currently, there are several ways to inform DataKit how to process the collected logs:
- The conf configuration in the Log Collector
In the log collector, through conf configuration you can specify the list of files to collect (or which TCP/UDP port to read log streams from). In conf, you can configure settings such as source/Pipeline/multi-line splitting/additional tags, etc.
If data is sent to DataKit via TCP/UDP, it can only be configured through logging.conf for subsequent log processing, because TCP/UDP protocols do not facilitate attaching additional descriptive information; they only transmit simple log stream data.
This form of log collection is the easiest to understand.
- The conf configuration in the Container Collector
Currently, the container collector's conf can only make very basic configurations for logs (based on container/Pod image names) and cannot configure subsequent log processing (such as Pipeline/source settings) here, because this conf targets all logs on the current host, and in a container environment, logs on one host are varied, making it impossible to categorize and configure them individually here.
- Inform DataKit how to configure log processing through requests
Through HTTP requests to DataKit's Log Streaming service, various request parameters can be included in the request to inform DataKit how to process the received log data.
- Annotate the object being collected (e.g., containers/Pods) to inform DataKit how to process their logs
As mentioned earlier, configuring log collection solely in the container collector conf is too coarse-grained and not conducive to fine-grained configuration. However, annotations or labels can be added to containers/Pods, and DataKit will actively discover these annotations to determine how to process each container/Pod's logs.
Priority Explanation¶
Under normal circumstances, annotations on containers/Pods have the highest priority, overriding settings in conf/Env; Env settings have medium priority, overriding conf configurations; configurations in conf have the lowest priority and can be overridden by Env or annotation settings at any time.
There are currently no direct Env variables related to log collection/processing, but relevant environment variables may be added in the future.
For example, in container.conf, assume we exclude the image named 'my_test' from log collection:
In this case, DataKit will not collect logs from all containers or Pods matching this image name. However, if the corresponding Pod has specific annotations:
apiVersion: apps/v1
kind: Pod
metadata:
name: test-app
annotations:
datakit/logs: | # <----------
[
{
"source": "my-testing-app",
"pipeline": "test.p",
}
]
spec:
containers:
- name : mytest
image: my_test:1.2.3
Even though we excluded all images matching my_test.*
in container.conf, DataKit will still collect logs from this Pod because it has specific annotations (datakit/logs
) and can configure settings like Pipeline.
How Log Data Is Processed¶
In DataKit, log data currently goes through the following stages of processing (listed in order):
- Collection Stage
After reading (receiving) logs from external sources, the collection stage performs basic processing. This includes log segmentation (splitting large blocks of text into multiple independent raw logs), encoding/decoding (uniformly converting to UTF8 encoding), removing interfering color characters, etc.
- Single Log Segmentation
If the corresponding log has Pipeline segmentation configured, each log (including multi-line single logs) will go through Pipeline segmentation. Pipeline mainly consists of two steps:
- Grok/JSON Segmentation: Through Grok/JSON, a single raw log is segmented into structured data.
-
Fine-tuning extracted fields: For example, completing IP information, log desensitization, etc.
-
Blacklist (Filter)
Filters are a set of filters that receive structured data and decide whether to discard the data based on certain logic. Filters are centrally issued (pulled by DataKit) and follow a format similar to:
If the center configures a log blacklist, assuming 10 out of 100 logs meet the condition (i.e., source is datakit
, and the value of the bar
field appears in the list), these 10 logs will not be reported to Guance and will be silently discarded. The statistics of discarded logs can be seen in DataKit Monitor.
- Reporting to Guance
After these steps, the log data is finally reported to Guance, where it can be viewed on the log viewing page.
Under normal circumstances, from log generation to seeing data on the page, if collection is successful, there is about a 30-second delay. During this period, DataKit itself reports data every 10 seconds at most, and the center also undergoes a series of processing before final storage.