Datakit Operator User Guide



Bug fix:

  • Fixed the issue where using wildcard paths in logfiles when injecting logfwd would cause a mount error.
  • Fixed a critical issue where injecting logfwd into a Pod with 2 or more containers would fail and affect the original Pod startup.

Overview and Installation

Datakit Operator is a collaborative project between Datakit and Kubernetes orchestration. It aims to assist the deployment of Datakit as well as other functions such as verification and injection.

Currently, Datakit Operator provides the following functions:

  • Injection DDTrace SDK(Java/Python/Node.js) and related environments. See documentation.
  • Injection Sidecar logfwd to collect Pod logging. See documentation.
  • Support task distribution for Datakit plugins. See documentation.


  • Recommended Kubernetes version 1.24.1 or above and internet access (to download yaml file and pull images).
  • Ensure MutatingAdmissionWebhook and ValidatingAdmissionWebhook controllers are enabled.
  • Ensure API is enabled.

Installation Steps

Download datakit-operator.yaml, and follow these steps:

$ kubectl create namespace datakit
$ wget
$ kubectl apply -f datakit-operator.yaml
$ kubectl get pod -n datakit

NAME                               READY   STATUS    RESTARTS   AGE
datakit-operator-f948897fb-5w5nm   1/1     Running   0          15s
  • There is a strict correspondence between Datakit-Operator's program and yaml files. If an outdated yaml file is used, it may not be possible to install the new version of Datakit-Operator. Please download the latest yaml file.
  • If you encounter InvalidImageName error, you can manually pull the image.

Relevant Configuration


The configuration for the Datakit Operator is in JSON format and is stored as a separate ConfigMap in Kubernetes, which is loaded into the container as an environment variable.

The default configuration is as follows:

    "server_listen": "",
    "log_level":     "info",
    "admission_inject": {
        "ddtrace": {
            "images": {
                "java_agent_image":   "",
                "python_agent_image": "",
                "js_agent_image":     ""
            "envs": {
                "DD_AGENT_HOST":           "datakit-service.datakit.svc",
                "DD_TRACE_AGENT_PORT":     "9529",
                "DD_JMXFETCH_STATSD_HOST": "datakit-service.datakit.svc",
                "DD_JMXFETCH_STATSD_PORT": "8125",
                "POD_NAME":                "{}",
                "POD_NAMESPACE":           "{fieldRef:metadata.namespace}",
                "NODE_NAME":               "{fieldRef:spec.nodeName}",
                "DD_TAGS":                 "pod_name:$(POD_NAME),pod_namespace:$(POD_NAMESPACE),host:$(NODE_NAME)"
        "logfwd": {
            "images": {
                "logfwd_image": ""
        "profiler": {
            "images": {
                "java_profiler_image":   "",
                "python_profiler_image": "",
                "golang_profiler_image": ""
            "envs": {
                "DK_AGENT_HOST":  "datakit-service.datakit.svc",
                "DK_AGENT_PORT":  "9529",
                "DK_PROFILE_VERSION": "1.2.333",
                "DK_PROFILE_ENV": "prod",
                "DK_PROFILE_DURATION": "240",
                "DK_PROFILE_SCHEDULE": "0 * * * *"

In admission_inject, you can configure ddtrace and logfwd more finely:

  • images is a collection of Key/Value pairs with fixed keys, where modifying the Value allows for customization of image paths.

The Datakit Operator's ddtrace agent image is stored centrally at For certain special environments that may not have access to this image repository, it is possible to modify the environment variables and specify an image path, as follows:

  1. In an environment that can access, pull the image and save it to your own image repository, for example inside.image.hub/datakit-operator/dd-lib-java-init:v1.8.4-guance.
  2. Modify the JSON configuration by changing admission_inject->ddtrace->images->java_agent_image to inside.image.hub/datakit-operator/dd-lib-java-init:v1.8.4-guance, and apply this YAML.
  3. Thereafter, the Datakit Operator will use the new Java Agent image path.

The Datakit Operator does not check images. If the image path is incorrect, Kubernetes will throw an error when creating the image.

If a version has already been specified in the admission.datakit/java-lib.version annotation, for example admission.datakit/java-lib.version:v2.0.1-guance or admission.datakit/java-lib.version:latest, the v2.0.1-guance version will be used.

  • envs is also a collection of Key/Value pairs, but with variable keys and values. The Datakit Operator injects all Key/Value environment variables into the target container. For example, add FAKE_ENV to envs:
    "admission_inject": {
        "ddtrace": {
            "images": {
                "java_agent_image":   "",
                "python_agent_image": "",
                "js_agent_image":     ""
            "envs": {
                "DD_AGENT_HOST":           "datakit-service.datakit.svc",
                "DD_TRACE_AGENT_PORT":     "9529",
                "FAKE_ENV":                "ok"

All containers that have ddtrace agent injected into them will have five environment variables added to their envs.

In Datakit Operator v1.4.2 and later versions, envs envs support for the Kubernetes Downward API environment variable fetch field. The following are now supported:

  • The pod's name.
  • metadata.namespace: The pod's namespace.
  • metadata.uid: The pod's unique ID.
  • metadata.annotations['<KEY>']: The value of the pod's annotation named <KEY> (for example, metadata.annotations['myannotation']).
  • metadata.labels['<KEY>']: The text value of the pod's label named <KEY> (for example, metadata.labels['mylabel']).
  • spec.serviceAccountName: The name of the pod's service account.
  • spec.nodeName: The name of the node where the Pod is executing.
  • status.hostIP: The primary IP address of the node to which the Pod is assigned.
  • status.hostIPs: The IP addresses is a dual-stack version of status.hostIP, the first is always the same as status.hostIP. The field is available if you enable the PodHostIPs feature gate.
  • status.podIP: The pod's primary IP address (usually, its IPv4 address).
  • status.podIPs: The IP addresses is a dual-stack version of status.podIP, the first is always the same as status.podIP.

If that write is not recognized, it is converted to a plain string and added to the environment variable. For example "POD_NAME":"{fieldRef:metadata.PODNAME}", which is the wrong way to write it, ends up in the environment variable being POD_NAME={fieldRef:metadata.PODNAME}.

Using Datakit-Operator to Inject Files and Programs

In large Kubernetes clusters, it can be quite difficult to make bulk configuration changes. Datakit-Operator will determine whether or not to modify or inject data based on Annotation configuration.

The following functions are currently supported:

  • Injection of ddtrace agent and environment
  • Mounting of logfwd sidecar and enabling log collection

Only version v1 of deployments/daemonsets/cronjobs/jobs/statefulsets Kind is supported, and because Datakit-Operator actually operates on the PodTemplate, Pod is not supported. In this article, we will use Deployment to describe these five kinds of Kind.

Injection of ddtrace Agent and Relevant Environment Variables


  1. On the target Kubernetes cluster, download and install Datakit-Operator.
  2. Add a specified Annotation in deployment, indicating the need to inject ddtrace files. Note that the Annotation needs to be added in the template.
    • The key is admission.datakit/%s-lib.version, where %s needs to be replaced with the specified language. Currently supports java, python and js.
    • The value is the specified version number. If left blank, the default image version of the environment variable will be used.


The following is an example of Deployment that injects dd-js-lib into all Pods created by Deployment:

apiVersion: apps/v1
kind: Deployment
  name: nginx-deployment
    app: nginx
  replicas: 1
      app: nginx
        app: nginx
        admission.datakit/js-lib.version: ""
      - name: nginx
        image: nginx:1.22
        - containerPort: 80

Create a resource using yaml file:

kubectl apply -f nginx.yaml

Verify as follows:

$ kubectl get pod
NAME                                   READY   STATUS    RESTARTS      AGE
nginx-deployment-7bd8dd85f-fzmt2       1/1     Running   0             4s

$ kubectl get pod nginx-deployment-7bd8dd85f-fzmt2 -o=jsonpath={.spec.initContainers\[\*\].name}

Injecting Logfwd Program and Enabling Log Collection


logfwd is a proprietary log collection application for Datakit. To use it, you need to first deploy Datakit in the same Kubernetes cluster and satisfy the following two conditions:

  1. The Datakit logfwdserver collector is enabled, for example, listening on port 9533.
  2. The Datakit service needs to open port 9533 to allow other Pods to access datakit-service.datakit.svc:9533.


  1. On the target Kubernetes cluster, download and install Datakit-Operator.
  2. In the deployment, add the specified Annotation to indicate that a logfwd sidecar needs to be mounted. Note that the Annotation should be added in the template.
    • The key is uniformly admission.datakit/logfwd.instances.
    • The value is a JSON string of specific logfwd configuration, as shown below:
        "datakit_addr": "datakit-service.datakit.svc:9533",
        "loggings": [
                "logfiles": ["<your-logfile-path>"],
                "ignore": [],
                "source": "<your-source>",
                "service": "<your-service>",
                "pipeline": "<your-pipeline.p>",
                "character_encoding": "",
                "multiline_match": "<your-match>",
                "tags": {}
                "logfiles": ["<your-logfile-path-2>"],
                "source": "<your-source-2>"

Parameter explanation can refer to logfwd configuration:

  • datakit_addr is the Datakit logfwdserver address.
  • loggings is the main configuration and is an array that can refer to Datakit logging collector.
    • logfiles is a list of log files, which can specify absolute paths and support batch specification using glob rules. Absolute paths are recommended.
    • ignore filters file paths using glob rules. If it meets any filtering condition, the file will not be collected.
    • source is the data source. If it is empty, 'default' will be used by default.
    • service adds a new tag. If it is empty, $source will be used by default.
    • pipeline is the Pipeline script path. If it is empty, $source.p will be used. If $source.p does not exist, the Pipeline will not be used. (This script file exists on the DataKit side.)
    • character_encoding selects an encoding. If the encoding is incorrect, the data cannot be viewed. It is recommended to leave it blank. Supported encodings include utf-8, utf-16le, utf-16le, gbk, gb18030, or "".
    • multiline_match is for multiline matching, as described in Datakit Log Multiline Configuration. Note that since it is in the JSON format, it does not support the "unescaped writing method" of three single quotes. The regex ^\d{4} needs to be written as ^\\d{4} with an escape character.
    • tags adds additional tags in JSON map format, such as { "key1":"value1", "key2":"value2" }.


Here is an example Deployment that continuously writes data to a file using shell and configures the collection of that file:

apiVersion: apps/v1
kind: Deployment
  name: logging-deployment
    app: logging
  replicas: 1
      app: logging
        app: logging
        admission.datakit/logfwd.instances: '[{"datakit_addr":"datakit-service.datakit.svc:9533","loggings":[{"logfiles":["/var/log/log-test/*.log"],"source":"deployment-logging","tags":{"key01":"value01"}}]}]'
      - name: log-container
        image: busybox
        args: [/bin/sh, -c, 'mkdir -p /var/log/log-test; i=0; while true; do printf "$(date "+%F %H:%M:%S") [%-8d] Bash For Loop Examples.\\n" $i >> /var/log/log-test/1.log; i=$((i+1)); sleep 1; done']

Creating Resources Using yaml File:

$ kubectl apply -f logging.yaml

Verify as follows:

$ kubectl get pod
NAME                                   READY   STATUS    RESTARTS      AGE
logging-deployment-5d48bf9995-vt6bb       1/1     Running   0             4s

$ kubectl get pod logging-deployment-5d48bf9995-vt6bb -o=jsonpath={.spec.containers\[\*\].name}
log-container datakit-logfwd

Finally, you can check whether the logs have been collected on the Observability Cloud Log Platform.



