Skip to content

DataKit Log Processing Overview

This document introduces how DataKit processes logs. In another document, we explained how DataKit collects logs. These two documents should be read together to gain a comprehensive understanding of the entire log processing pipeline.

Key questions:

  • Why is the configuration for log collection so complex?
  • How is log data processed?

Why is Log Collection Configuration So Complex?

From this document, we know that because logs come from a variety of sources, the configuration methods are also diverse. We need to clarify this here to help you understand.

During the log collection process, DataKit has two main types of collection methods:

In these different forms of log collection, they all need to solve the same core issue: How does DataKit handle these logs next?

Breaking down this core issue further, it can be divided into the following sub-questions:

  • Determining what the source is: All subsequent log processing depends on this field (there is an additional service field, but if not specified, its value will be set the same as the source)
  • How to configure the Pipeline: Although not mandatory, it is widely used
  • Additional Tag Configuration: Also not mandatory, but sometimes it has special functions
  • How to split multi-line logs: Need to tell DataKit how the target logs are separated into individual logs (by default, DataKit considers each line starting with non-whitespace characters as a new log)
  • Whether there are specific ignore policies: Not all data collected by DataKit needs to be processed; if certain conditions are met, they can be chosen not to be collected (even though they meet the collection criteria)
  • Other special configurations: Such as filtering color characters, text encoding handling, etc.

Currently, there are several ways to inform DataKit how to process the collected logs:

In the log collector, through conf configuration you can specify the list of files to collect (or which TCP/UDP port to read log streams from). In conf, you can configure settings such as source/Pipeline/multi-line splitting/additional tags, etc.

If data is sent to DataKit via TCP/UDP, it can only be configured through logging.conf for subsequent log processing, because TCP/UDP protocols do not facilitate attaching additional descriptive information; they only transmit simple log stream data.

This form of log collection is the easiest to understand.

Currently, the container collector's conf can only make very basic configurations for logs (based on container/Pod image names) and cannot configure subsequent log processing (such as Pipeline/source settings) here, because this conf targets all logs on the current host, and in a container environment, logs on one host are varied, making it impossible to categorize and configure them individually here.

  • Inform DataKit how to configure log processing through requests

Through HTTP requests to DataKit's Log Streaming service, various request parameters can be included in the request to inform DataKit how to process the received log data.

  • Annotate the object being collected (e.g., containers/Pods) to inform DataKit how to process their logs

As mentioned earlier, configuring log collection solely in the container collector conf is too coarse-grained and not conducive to fine-grained configuration. However, annotations or labels can be added to containers/Pods, and DataKit will actively discover these annotations to determine how to process each container/Pod's logs.

Priority Explanation

Under normal circumstances, annotations on containers/Pods have the highest priority, overriding settings in conf/Env; Env settings have medium priority, overriding conf configurations; configurations in conf have the lowest priority and can be overridden by Env or annotation settings at any time.

There are currently no direct Env variables related to log collection/processing, but relevant environment variables may be added in the future.

For example, in container.conf, assume we exclude the image named 'my_test' from log collection:

container_exclude_log = ["image:my_test*"]

In this case, DataKit will not collect logs from all containers or Pods matching this image name. However, if the corresponding Pod has specific annotations:

apiVersion: apps/v1
kind: Pod
metadata:
  name: test-app
  annotations:
    datakit/logs: |   # <----------
      [
        {
          "source": "my-testing-app",
          "pipeline": "test.p",
        }
      ]

spec:
   containers:
   - name : mytest
     image: my_test:1.2.3

Even though we excluded all images matching my_test.* in container.conf, DataKit will still collect logs from this Pod because it has specific annotations (datakit/logs) and can configure settings like Pipeline.

How Log Data Is Processed

In DataKit, log data currently goes through the following stages of processing (listed in order):

  • Collection Stage

After reading (receiving) logs from external sources, the collection stage performs basic processing. This includes log segmentation (splitting large blocks of text into multiple independent raw logs), encoding/decoding (uniformly converting to UTF8 encoding), removing interfering color characters, etc.

  • Single Log Segmentation

If the corresponding log has Pipeline segmentation configured, each log (including multi-line single logs) will go through Pipeline segmentation. Pipeline mainly consists of two steps:

  1. Grok/JSON Segmentation: Through Grok/JSON, a single raw log is segmented into structured data.
  2. Fine-tuning extracted fields: For example, completing IP information, log desensitization, etc.

  3. Blacklist (Filter)

Filters are a set of filters that receive structured data and decide whether to discard the data based on certain logic. Filters are centrally issued (pulled by DataKit) and follow a format similar to:

{ source = 'datakit' AND bar IN [ 1, 2, 3] }

If the center configures a log blacklist, assuming 10 out of 100 logs meet the condition (i.e., source is datakit, and the value of the bar field appears in the list), these 10 logs will not be reported to Guance and will be silently discarded. The statistics of discarded logs can be seen in DataKit Monitor.

  • Reporting to Guance

After these steps, the log data is finally reported to Guance, where it can be viewed on the log viewing page.

Under normal circumstances, from log generation to seeing data on the page, if collection is successful, there is about a 30-second delay. During this period, DataKit itself reports data every 10 seconds at most, and the center also undergoes a series of processing before final storage.

Further Reading

Feedback

Is this page helpful? ×