Overview of DataKit Log Processing
This article is used to introduce how DataKit handles logs. In another document, we introduced how DataKit collects logs. These two documents can be seen together, hoping that everyone has a more comprehensive understanding of the whole log processing.
Core issues:
- Why the configuration of log collection is so complicated
- How the log data is processed
Why is the Configuration of Log Collection So Complicated¶
We can know from this document that because of the various sources of logs, there are various ways to configure logs, so it is necessary for us to sort them out here for everyone to understand.
In the process of log collection, DataKit has two types of collection methods: active and passive:
-
Active collection
- Collect disk file logs directly
- Collect logs generated by containers
-
Passive collection
In these different forms of log collection, they all have to solve the same core problem: What will DataKit do with these logs next?
This core issue can be subdivided into the following sub-issues:
- Determine what the
source
is: All subsequent log processing depends on this field (there is an additionalservice
field, but if it is not specified, its value will be set to the same as the source value). - How to configure Pipeline: It is not necessary, but it is widely used
- Extra Tag configuration: It is not necessary, but sometimes it has its special function
- How to cut multi-line logs: DataKit needs to be told how the target log splits each log. (DataKit defaults to a new log for each line that begins with a non-white space character.)
- Is there a special ignorance strategy: Not all data collected by DataKit needs to be processed, and if certain conditions are met, you can choose not to collect them (although they meet the collection conditions)
- Other features: such as filtering color characters, text encoding processing, etc.
At present, there are several ways to tell DataKit what to do with the obtained logs:
- Configuration of conf in log collector
In the log collector, the list of files to be collected (or which TCP/UDP port to read the log stream from) can be configured through conf, and various settings such as source/Pipeline/multi-line cutting/additional tag can be configured in conf.
If the data is sent to DataKit in TCP/UDP, the subsequent log processing can only be configured through logging.conf, because TCP/UDP is not convenient to attach additional description information, and they are only responsible for transmitting simple log stream data.
This form of log collection is the easiest way to understand.
- Configuration of conf in container collector
At present, the container collector conf can only make the most superficial configuration for logs (based on the container/Pod image name), and cannot configure the subsequent processing of logs (such as Pipeline/source settings, etc.) here, because this conf is aimed at the collection of all logs on the current host, while in the container environment, there are many logs on a host, so it cannot be configured one by one in different categories.
- How to configure logging processing by telling DataKit in the request
The DataKit's logstreaming service is requested over HTTP, with various request parameters in the request to tell the DataKit how to process the received log data.
- Make specific annotations on the collected objects (such as containers/pods) to tell DataKit how to handle the logs they generate
As mentioned earlier, simply configuring log collection in container collector conf is too coarse in fine granularity, which is not conducive to fine configuration. However, labels can be marked on containers/Pod, and DataKit will actively discover these labels, and then know how to handle each container/Pod log.
Priority Description¶
In general, the annotation on the container/Pod has the highest priority, which will override the settings on the conf/Env; Secondly, the priority of Env is centered, which will overwrite the configuration in conf; Configuration in conf has the lowest priority, and the configuration in conf may be overridden by settings in Env or annotation at any time.
At present, there is no Env directly related to log collection/processing, and related environment variables may be added later.
As an example, in container.conf, suppose we exclude a mirror named 'my_test' from log collection:
At this point, DataKit does not collect all containers or Pods that match the mirror name. However, if the corresponding Annotation is marked on the corresponding Pod:
apiVersion: apps/v1
kind: Pod
metadata:
name: test-app
annotations:
datakit/logs: | # <----------
[
{
"source": "my-testing-app",
"pipeline": "test.p",
}
]
spec:
containers:
- name : mytest
image: my_test:1.2.3
Even if we exclude all mirroring of the wildcard my_test.*
in container.conf, because the Pod has a specific annotation (datakit/logs
), DataKit still collects the Pod's log and can configure many settings such as Pipeline.
How Log Data is Processed¶
In DataKit, logs are currently processed in the following stages (enumerated in the order of processing):
- Collection phase
After reading (receiving) the log from the outside, the basic processing will be carried out in the acquisition stage. These processes include log segmentation (dividing large text into several independent bare logs), coding and decoding (converting to UTF8 coding), eliminating some disturbing color characters and so on.
- Single Log Cutting
If the corresponding log is configured with Pipeline cutting, then each log (including a single multi-line log) will be cut by Pipeline, which is mainly divided into two steps:
- Grok/JSON cutting: Through Grok/JSON, a single Raw log is cut into structured data.
-
The extracted fields are processed finely, such as completing IP information, desensitizing logs, etc.
-
Blacklist(Filter)
Filter is a set of filters, which receives a set of structured data and decides whether the data is discarded or not through certain logical judgment. Filter is a set of logical operation rules distributed by the center (actively pulled by DataKit), and its form is roughly as follows:
If the center configures a log blacklist, assuming that 10 of the 100 cut logs meet the conditions here (that is, the source is datakit
and the value of the bar
field appears in the following list), then these 10 logs will not be reported to Guance Cloud and will be silently discarded. You can see the discarded statistics in the DataKit Monitor.
- Report Guance Cloud
After these steps, the log data is finally reported to Guance Cloud, and the log data can be seen on the log viewing page.
Under normal circumstances, there is a delay of about 30s from the time when the log is generated, if the collection is successful, to the time when the data can be seen on the page. During this period, the data of DataKit itself is reported once for a maximum of 10s, and the center has to go through a series of processes before finally warehousing.