Kubernetes Logs

DataKit supports collecting Kubernetes and host container logs, which can be classified into the following two types based on the data source:

Console output: This refers to the stdout/stderr output of the container application, which is the most common way. It can be viewed using commands like docker logs or kubectl logs.
Container internal files: If the logs are not output to stdout/stderr, they are usually stored in files. Collecting this type of logs requires mounting.

This article will provide a detailed introduction to these two collection methods.

Logging Collection for Console stdout/stderr¶

Console output (stdout/stderr) is written to files by the container runtime, and DataKit automatically fetches the LogPath of the container for collection.

If you want to customize the collection configuration, it can be done through adding container environment variables or Kubernetes Pod Annotations.

-The following are the key scenarios for custom configurations: - For container environment variables, the key must be set as DATAKIT_LOGS_CONFIG. - For Pod Annotations, there are two possible formats: - datakit/$CONTAINER_NAME.logs, where $CONTAINER_NAME needs to be replaced with the current Pod's container name. This format is used in multi-container environments. - datakit/logs applies to all containers of the Pod.

Info

If a container has an environment variable DATAKIT_LOGS_CONFIG and can also find the Annotation datakit/logs of its corresponding Pod, the configuration from the container environment variable will take precedence.

The value for custom configurations is as follows:

[
  {
    "disable" : false,
    "source"  : "<your-source>",
    "service" : "<your-service>",
    "pipeline": "<your-pipeline.p>",
    "remove_ansi_escape_codes": false,
    "from_beginning"          : false,
    "tags" : {
      "<some-key>" : "<some_other_value>"
    }
  }
]

Field explanations:

Field Name	Possible Values	Explanation
`disable`	true/false	Whether to disable log collection for the container. The default value is `false`.
`type`	`file`/empty	The type of collection. If collecting logs from container internal files, it must be set as `file`. The default value is empty, which means collecting `stdout/stderr`.
`path`	string	The configuration file path. If collecting logs from container internal files, it should be set as the path of the volume, which is accessible from outside the container. The default is not required when collecting `stdout/stderr`.
`source`	string	The source of the logs. Refer to Configuring the Source for Container Log Collection.
`service`	string	The service to which the logs belong. The default value is the log source (`source`).
`pipeline`	string	The Pipeline script for processing the logs. The default value is the script name that matches the log source (`<source>.p`).
`remove_ansi_escape_codes`	true/false	Enable ANSI codes removal.
`from_beginning`	true/false	Whether to collect logs from the begin of the file.
`multiline_match`	regular expression string	The pattern used for recognizing the first line of a multiline log match, e.g., `"multiline_match":"^\\d{4}"` indicates that the first line starts with four digits. In regular expression rules, `\d` represents a digit, and the preceding `\` is used for escaping.
`character_encoding`	string	The character encoding. If the encoding is incorrect, the data may not be viewable. Supported values are `utf-8`, `utf-16le`, `utf-16le`, `gbk`, `gb18030`, or an empty string. The default is empty.
`tags`	key/value pairs	Additional tags to be added. If there are duplicate keys, the value in this configuration will take precedence ( Version-1.4.6).

Below is a complete example:

Container Environment VariablesKubernetes Pod Annotation

$ cat Dockerfile
FROM pubrepo.guance.com/base/ubuntu:18.04 AS base
RUN mkdir -p /opt
RUN echo 'i=0; \n\
while true; \n\
do \n\
    echo "$(date +"%Y-%m-%d %H:%M:%S")  [$i]  Bash For Loop Examples. Hello, world! Testing output."; \n\
    i=$((i+1)); \n\
    sleep 1; \n\
done \n'\
>> /opt/s.sh
CMD ["/bin/bash", "/opt/s.sh"]

## Build the image
$ docker build -t testing/log-output:v1 .

## Start the container, add the environment variable DATAKIT_LOGS_CONFIG (note the character escaping)
$ docker run --name log-output -env DATAKIT_LOGS_CONFIG='[{"disable":false,"source":"log-source","service":"log-service"}]' -d testing/log-output:v1

log-output.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: log-demo-deployment
  labels:
    app: log-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: log-demo
  template:
    metadata:
      labels:
        app: log-demo
      annotations:
        ## Add the configuration and specify the container as log-output
        datakit/log-output.logs: |
          [{
              "disable": false,
              "source":  "log-output-source",
              "service": "log-output-service",
              "tags" : {
                "some_tag": "some_value"
              }
          }]
    spec:
      containers:
      - name: log-output
        image: pubrepo.guance.com/base/ubuntu:18.04
        args:
        - /bin/sh
        - -c
        - >
          i=0;
          while true;
          do
            echo "$(date +'%F %H:%M:%S')  [$i]  Bash For Loop Examples. Hello, world! Testing output.";
            i=$((i+1));
            sleep 1;
          done

$ kubectl apply -f log-output.yaml
...

Warning

If not necessary, avoid configuring the Pipeline in environment variables and Pod Annotations. In general, it can be automatically inferred through the source field
When adding Env/Annotations in configuration files or terminal commands, both sides should be enclosed in double quotes with escape characters

The value of multiline_match requires double escaping, with 4 backslashes representing a single one. For example, \"multiline_match\":\"^\\\\d{4}\" is equivalent to "multiline_match":"^\d{4}". Here's an example:

kubectl annotate pods my-pod datakit/logs="[{\"disable\":false,\"source\":\"log-source\",\"service\":\"log-service\",\"pipeline\":\"test.p\",\"only_images\":[\"image:<your_image_regexp>\"],\"multiline_match\":\"^\\\\d{4}-\\\\d{2}\"}]"

If a Pod/Container log is already being collected, adding configuration via the kubectl annotate command does not take effect

Logging for Log Files Inside Containers¶

For log files inside containers, the configuration is similar to logging console output, except that you need to specify the file path. Other configurations are mostly the same.

Similarly, you can add the configuration either as a container environment variable or a Kubernetes Pod Annotation. The key and value remain the same as mentioned earlier. Please refer to the previous section for details.

Here is a complete example:

Container Environment VariablesKubernetes Pod Annotation

$ cat Dockerfile
FROM pubrepo.guance.com/base/ubuntu:18.04 AS base
RUN mkdir -p /opt
RUN echo 'i=0; \n\
while true; \n\
do \n\
    echo "$(date +"%Y-%m-%d %H:%M:%S")  [$i]  Bash For Loop Examples. Hello, world! Testing output." >> /tmp/opt/log; \n\
    i=$((i+1)); \n\
    sleep 1; \n\
done \n'\
>> /opt/s.sh
CMD ["/bin/bash", "/opt/s.sh"]

## Build the image
$ docker build -t testing/log-to-file:v1 .

## Start the container, add the environment variable DATAKIT_LOGS_CONFIG (note the character escaping).
## Unlike configuring stdout, "type" and "path" are mandatory fields, and add the path volume.
## Path `/tmp/opt/log` add the `/tmp/opt` anonymous volumes.
$ docker run --env DATAKIT_LOGS_CONFIG="[{\"disable\":false,\"type\":\"file\",\"path\":\"/tmp/opt/log\",\"source\":\"log-source\",\"service\":\"log-service\"}]" -v /tmp/opt -d testing/log-to-file:v1

logging.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: log-demo-deployment
  labels:
    app: log-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: log-demo
  template:
    metadata:
      labels:
        app: log-demo
      annotations:
        ## Add the configuration and specify the container as logging-demo.
        ## Configure both file and stdout collection, need to add the emptyDir volume to "/tmp/opt" first.
        datakit/logging-demo.logs: |
          [
            {
              "disable": false,
              "type": "file",
              "path":"/tmp/opt/log",
              "source":  "logging-file",
              "tags" : {
                "some_tag": "some_value"
              }
            },
            {
              "disable": false,
              "source":  "logging-output"
            }
          ]
    spec:
      containers:
      - name: logging-demo
        image: pubrepo.guance.com/base/ubuntu:18.04
        args:
        - /bin/sh
        - -c
        - >
          i=0;
          while true;
          do
            echo "$(date +'%F %H:%M:%S')  [$i]  Bash For Loop Examples. Hello, world! Testing output.";
            echo "$(date +'%F %H:%M:%S')  [$i]  Bash For Loop Examples. Hello, world! Testing output." >> /tmp/opt/log;
            i=$((i+1));
            sleep 1;
          done
        volumeMounts:
        - mountPath: /tmp/opt
          name: datakit-vol-opt
      volumes:
      - name: datakit-vol-opt
        emptyDir: {}

$ kubectl apply -f logging.yaml

For log files inside containers, in a Kubernetes environment, you can also achieve collection by adding a sidecar. Please refer to here for more information.

Adjust Log Collection According to Container Image¶

By default, DataKit collects stdout/stderr logs for all containers on your machine/Node, which may not be expected. Sometimes, we want to collect only (or not) the logs of some containers, where the target container/Pod can be indirectly referred to by the mirror name.

host installationKubernetes

## Take image for example.
## Capture a container's log when its image matches `datakit`.
container_include_log = ["image:datakit"]
## Ignore all kodo containers
container_exclude_log = ["image:kodo"]

container_include and container_exclude must start with an attribute field in a sort of Glob wildcard for class regularity: "<field name>:<glob rule>"

The following 4 field rules are now supported, all of which are infrastructure attribute fields:

image : image:pubrepo.guance.com/datakit/datakit:1.18.0
image_name : image_name:pubrepo.guance.com/datakit/datakit
image_short_name : image_short_name:datakit
namespace : namespace:datakit-ns

For the same type of rule (image or namespace), if both include and exclude exist, the condition that include holds and exclude does not hold needs to be satisfied. For example:

## This causes all containers to be filtered. If there is a container ``datakit`` that satisfies both ``include`` and ``exclude``, then it will be filtered out of log collection; if there is a container ``nginx`` that does not satisfy ``include`` in the first place, it will be filtered out of log collection.

container_include_log = ["image_name:datakit"]
container_exclude_log = ["image_name:*"]

Any one of the field rules for multiple types matches and its logs are no longer captured. Example:

## The container only needs to match either `image_name` and `namespace` to stop collecting logs.

container_include_log = []
container_exclude_log = ["image_name:datakit", "namespace:datakit-ns"]

The configuration rules for container_include_log and container_exclude_log are complex, and their simultaneous use can result in a variety of priority cases. It is recommended to use only container_exclude_log.

The following environment variables can be used

ENV_INPUT_CONTAINER_CONTAINER_INCLUDE_LOG
ENV_INPUT_CONTAINER_CONTAINER_EXCLUDE_LOG

to configure log collection for the container. Suppose there are three Pods whose images are:

A：hello/hello-http:latest
B：world/world-http:latest
C：pubrepo.guance.com/datakit/datakit:1.2.0

If you want to collect only the logs of Pod A, configure ENV_INPUT_CONTAINER_CONTAINER_INCLUDE_LOG.

- env:
  - name: ENV_INPUT_CONTAINER_CONTAINER_INCLUDE_LOG
    value: image:hello*  # Specify the image name or its wildcard

Or namespace:

- env:
  - name: ENV_INPUT_CONTAINER_CONTAINER_INCLUDE_LOG
    value: namespace:foo  # Specify the namespace or its wildcard

How to view a mirror

Docker：

docker inspect --format "{{.Config.Image}}" $CONTAINER_ID

Kubernetes Pod：

echo `kubectl get pod -o=jsonpath="{.items[0].spec.containers[0].image}"`

Warning

The priority of the global configuration container_exclude_log is lower than the custom configuration disable within the container. For example, if container_exclude_log = ["image:*"] is configured to exclude all logs, but there is a Pod Annotation as follows:

[
  {
      "disable": false,
      "type": "file",
      "path":"/tmp/opt/log",
      "source":  "logging-file",
      "tags" : {
        "some_tag": "some_value"
      }
  },
  {
      "disable": true,
      "source":  "logging-output"
  }
]

This configuration is closer to the container and has a higher priority. The disable=false in the configuration indicates that log files should be collected, overriding the global configuration.

Therefore, the log files for this container will still be collected, but the stdout/stderr console output will not be collected because of disable=true.

FAQ¶

DataKit offers two methods for filtering specific containers and preventing their logs from being collected. These methods include using the container_include_log and container_exclude_log settings in the container.conf file, along with their corresponding environment variables. Additionally, you can achieve the same effect by using the datakit/logs annotation with "disable": true.

The filtering process works as follows:

If a container has a datakit/logs annotation or environment variable, and all "disable": true settings are active, the container's logs will be ignored and not collected.
If the Pod to which the container belongs is created by a Job or CronJob, the container's logs will not be collected.
The container_include_log and container_exclude_log settings only apply when all conditions are met:
For example, with container_include_log = ["image:redis*"] and container_exclude_log = ["namespace:middleware*"], logs will only be collected if the container's image matches redis* and the namespace does not match middleware*.
If only container_include_log = ["image:redis*"] is specified, logs will be collected as long as this condition is met.

Since using both container_include_log and container_exclude_log together can be complex, it is recommended to use only one of them.

Issue with Soft Links in Log Directories¶

Normally, DataKit retrieves the path of log files from the container/Kubernetes API and collects the file accordingly.

However, in some special environments, a soft link may be created for the directory containing the log file, and DataKit is unable to know the target of the soft link in advance, which prevents it from mounting the directory and collecting the log file.

For example, suppose a container log file is located at /var/log/pods/default_log-demo_f2617302-9d3a-48b5-b4e0-b0d59f1f0cd9/log-output/0.log. In the current environment, /var/log/pods is a soft link pointing to /mnt/container_logs, as shown below:

root@node-01:~# ls /var/log -lh
total 284K
lrwxrwxrwx 1 root root   20 Oct  8 10:06 pods -> /mnt/container_logs/

To enable DataKit to collect the log file, /mnt/container_logs hostPath needs to be mounted. For example, the following can be added to datakit.yaml:

# .. omitted..
spec:
  containers:
  - name: datakit
    image: pubrepo.guance.com/datakit/datakit:1.16.0
    volumeMounts:
    - mountPath: /mnt/container_logs
      name: container-logs
  # .. omitted..
  volumes:
  - hostPath:
      path: /mnt/container_logs
    name: container-logs

This situation is not very common and is usually only executed when it is known in advance that there is a soft link in the path or when DataKit logs indicate collection errors.

Source Setting for Container Log Collection¶

In the container environment, the log source setting is a very important configuration item, which directly affects the display effect on the page. However, it would be cruel to configure a source for each container's logs one by one. Without manually configuring the container log source, DataKit has the following rule (descending priority) for automatically inferring the source of the container log:

Attention

The so-called not manually specifying the container log source means that it is not specified in Pod Annotation or in container.conf (currently there is no configuration item specifying the container log source in container.conf).

Container's own name: The name that can be seen through docker ps or crictl ps.
Container name specified by Kubernetes: Obtained from the io.kubernetes.container.name label of the container.
default: Default source.

Retain Specific Fields Based on Whitelist¶

Container logs collection includes the following basic fields:

Field Name
`service`
`status`
`filepath`
`log_read_lines`
`container_id`
`container_name`
`namespace`
`pod_name`
`pod_ip`
`deployment`/`daemonset`/`statefulset`
`inside_filepath`

In specific scenarios, many of the basic fields are not necessary. A whitelist feature is provided to retain only the specified fields.

The field whitelist configuration such as ENV_INPUT_CONTAINER_LOGGING_FIELD_WHITE_LIST = '["service", "filepath", "container_name"]'. The details are as follows:

If the whitelist is empty, all basic fields will be included.
If the whitelist is not empty and the value is valid, such as ["filepath", "container_name"], only these two fields will be retained.
If the whitelist is not empty and all fields are invalid, such as ["no-exist"] or ["no-exist-key1", "no-exist-key2"], the data will be discarded.

For tags from other sources, the following situations apply:

The whitelist does not work on DataKit's global tags.
Debug fields enabled via ENV_ENABLE_DEBUG_FIELDS = "true" are not affected, including the log_read_offset and log_file_inode fields for log collection, as well as the debug fields in the pipeline.

Wildcard Collection of Log Files in Containers¶

To collect log files within a container, you need to add a configuration in Annotations/Labels and specify the path as follows:

[
  {
    "disable": false,
    "type": "file",
    "path":"/tmp/opt/log",
    "source":  "logging-file",
    "tags" : {
      "some_tag": "some_value"
    }
  }
]

The path configuration supports glob rules for batch specification. For example, if you want to collect /var/top/mysql/1.log and /var/opt/mysql/errors/2.log, you can write it like this:

[
  {
    "disable": false,
    "type": "file",
    "path":"/tmp/opt/**/*.log",
    "source":  "logging-file",
    "tags" : {
      "some_tag": "some_value"
    }
  }
]

The path configuration uses doublestar (**) to match multiple directories, and *.log will match all files ending with .log. This way, log files with different directories and names will be collected.

Note that the mounting directory for the emptyDir volume must be higher than the directory to be matched. Taking the example of collecting /tmp/opt/**/*.log, you must mount /tmp/opt or a higher-level directory like /tmp, otherwise, the corresponding files will not be found.

Kubernetes Logs

Logging Collection for Console stdout/stderr¶

Logging for Log Files Inside Containers¶

Adjust Log Collection According to Container Image¶

FAQ¶

Filtering Specific Containers from Log Collection¶

Issue with Soft Links in Log Directories¶

Source Setting for Container Log Collection¶

Retain Specific Fields Based on Whitelist¶

Wildcard Collection of Log Files in Containers¶

Extended Reading¶

Is this page helpful? ×