Sidecar for Pod Logging
""
In order to collect the log of application container in Kubernetes Pod, a lightweight log collection client is provided, which is mounted in Pod in sidecar mode and sends the collected log to DataKit.
Use¶
It is divided into two parts, one is to configure DataKit to start the corresponding log receiving function, and the other is to configure and start logfwd collection.
DataKit Configuration¶
You need to open logfwdserver, go to the conf.d/samples directory under the DataKit installation directory, copy logfwdserver.conf.sample and name it logfwdserver.conf. Examples are as follows:
[inputs.logfwdserver] # Note that this is the configuration of logfwdserver
## logfwd receiver listens for addresses and ports
address = "0.0.0.0:9533"
[inputs.logfwdserver.tags]
# some_tag = "some_value"
# more_tag = "some_other_value"
Once configured, restart DataKit.
The collector can now be turned on by injecting logfwdserver collector configuration in ConfigMap mode.
logfwd Usage and Configuration (1.86.0 and later)¶
logfwd is recommended for use in Kubernetes Serverless environments. If DaemonSet DataKit is already deployed, using logfwd may result in duplicate data.
Since logfwd version 1.86.0, the overall usage has been further simplified, and some cumbersome configurations have been removed. The main new capabilities are as follows:
- Support pulling
ClusterLoggingConfigCRD through DataKit-Operator, automatically matching Pods and hot-loading collection configurations; - Compatible with manual environment variable configuration (
LOGFWD_LOG_CONFIGS) for scenarios without DataKit-Operator or debugging; - Collection tasks communicate with DataKit
inputs.logfwdservervia WebSocket, with automatic reconnection on connection failure (retry every second); - Automatically supplement Pod metadata (
pod_name,namespace,pod_ip) and target Labels, seamlessly compatible with the old volume/mount solution.
Startup Method Overview¶
| Scenario | Key Variables | Description |
|---|---|---|
| With DataKit-Operator (Recommended) | LOGFWD_DATAKIT_OPERATOR_ENDPOINT + Pod metadata |
DataKit-Operator returns matching CRD JSON, logfwd automatically creates/refreshes tailers; log paths, pipelines, etc. need to be declared in ClusterLoggingConfig. |
| Manual Configuration | LOGFWD_LOG_CONFIGS |
Consistent with the old JSON semantics, but passed through environment variables, suitable for development/transition scenarios, can coexist with DataKit-Operator (manual configuration has higher priority). |
You still need to prepare shared volume/volumeMount for log files as in the old version; logfwd only monitors and does not create mounts.
Global Environment Variables¶
| Environment Variable Name | Configuration Item Meaning |
|---|---|
LOGFWD_LOG_LEVEL |
Runtime log level, default info, set to debug to see more debug output. |
LOGFWD_DATAKIT_HOST |
DataKit instance address (IP or resolvable domain name). |
LOGFWD_DATAKIT_PORT |
DataKit logfwdserver listening port, e.g., 9533. |
LOGFWD_DATAKIT_OPERATOR_ENDPOINT |
DataKit-Operator Endpoint, such as datakit-operator.datakit.svc:443 or https://datakit-operator.datakit.svc:443, used to query CRD configuration; leave empty to skip pulling. Supports automatic addition of https:// prefix. |
LOGFWD_GLOBAL_SOURCE |
Global source, priority higher than the source field in individual configurations. |
LOGFWD_GLOBAL_SERVICE |
Global service, if not specified in individual configuration, use global value; if global value is also empty, fall back to source. |
LOGFWD_GLOBAL_STORAGE_INDEX |
Global storage_index, priority higher than the storage_index field in individual configurations. |
LOGFWD_POD_NAME |
Automatically writes pod_name tag, usually injected via Downward API. |
LOGFWD_POD_NAMESPACE |
Automatically writes namespace tag. |
LOGFWD_POD_IP |
Automatically writes pod_ip tag for locating container instances. |
Tip: If you need to attach more tags, you can mount the
/etc/podinfo/labelsfile in the Pod (automatically added when DataKit-Operator injects logfwd sidecar), and logfwd will parse and align withpodTargetLabelsin the CRD.
Collection Configuration¶
logfwd supports two configuration methods, in order of priority from high to low:
- Manual Configuration (
LOGFWD_LOG_CONFIGS): Pass JSON string through environment variables, structure is basically consistent with the oldloggingssub-items. When manual configuration exists, logfwd immediately creates tailers and maintains this configuration during the process lifetime; after deleting the variable or clearing the content, the container needs to be restarted to release. - DataKit-Operator CRD: When
LOGFWD_DATAKIT_OPERATOR_ENDPOINTis specified, logfwd calls the DataKit-Operator API once per minute, and determines whether hot update is needed by MD5 verification of configuration content. After configuration changes, tailers are automatically recreated without restarting the container.
Note: If both manual configuration and CRD configuration exist and point to the same log path, duplicate collection will occur. It is recommended to prioritize CRD configuration, and manual configuration is only for debugging or special scenarios.
The LOGFWD_LOG_CONFIGS field structure example is as follows:
[
{
"type": "file",
"disable": false,
"source": "nginx-access",
"service": "nginx",
"path": "/var/log/nginx/access.log",
"pipeline": "nginx-access.p",
"storage_index": "app-logs",
"multiline_match": "^\\d{4}-\\d{2}-\\d{2}",
"remove_ansi_escape_codes": false,
"from_beginning": false,
"character_encoding": "utf-8",
"tags": {
"env": "production",
"team": "backend"
}
}
]
| Field | Type | Required | Description | Example |
|---|---|---|---|---|
type |
string | Yes | logfwd collection type can only be "file" |
"file" |
disable |
boolean | No | Whether to disable this collection configuration | false |
source |
string | Yes | Log source identifier, used to distinguish different log streams | "nginx-access" |
service |
string | No | Service to which the log belongs, default value is the log source (source) | "nginx" |
path |
string | Conditionally Required | Log file path (supports glob patterns), required when type=file | "/var/log/nginx/*.log" |
multiline_match |
string | No | Regular expression for the start line of multi-line logs, note that backslashes need to be escaped in JSON | "^\\d{4}-\\d{2}-\\d{2}" |
pipeline |
string | No | Log parsing pipeline configuration file name (needs to be configured on the DataKit side) | "nginx-access.p" |
storage_index |
string | No | Log storage index name | "app-logs" |
remove_ansi_escape_codes |
boolean | No | Whether to remove ANSI escape characters (color codes, etc.) from log data | false |
from_beginning |
boolean | No | Whether to collect logs from the beginning of the file (default starts from the end of the file) | false |
from_beginning_threshold_size |
int | No | When a file is discovered, if the file size is less than this value, start reading from the beginning of the file, in bytes, default 20MB | 1000 |
character_encoding |
string | No | Character encoding, supports utf-8, utf-16le, utf-16be, gbk, gb18030 or empty string (auto-detect). Default is empty. |
"utf-8" |
tags |
object | No | Additional tag key-value pairs that will be attached to each log record | {"env": "prod"} |
When LOGFWD_DATAKIT_OPERATOR_ENDPOINT is configured, logfwd will make requests to DataKit-Operator based on LOGFWD_POD_NAMESPACE, LOGFWD_POD_NAME, and pod_labels (optional, requires mounting /etc/podinfo/labels file). As long as a ClusterLoggingConfig CRD rule matches the current Pod, the corresponding configs JSON will be returned and hot update will be triggered.
CRD Configuration Example:
apiVersion: logging.datakits.io/v1alpha1
kind: ClusterLoggingConfig
metadata:
name: nginx-logs
spec:
selector:
namespaceRegex: "^(default|production)$"
podRegex: "^(nginx-.*)$"
podLabelSelector: "app=nginx,env=production"
containerRegex: "^(nginx|app)$"
podTargetLabels:
- app
- version
- team
configs:
- type: "file"
source: "nginx-access"
path: "/var/log/nginx/access.log"
pipeline: "nginx-access.p"
storage_index: "app-logs"
tags:
log_type: "access"
component: "nginx"
- type: "file"
source: "nginx-error"
path: "/var/log/nginx/error.log"
pipeline: "nginx-error.p"
storage_index: "app-logs"
tags:
log_type: "error"
component: "nginx"
CRD Selector Description:
| Field | Type | Required | Description | Example |
|---|---|---|---|---|
namespaceRegex |
string | No | Namespace name regex match (all conditions are AND relationship) | "^(default\|production)$" |
podRegex |
string | No | Pod name regex match | "^(nginx-.*)$" |
podLabelSelector |
string | No | Pod label selector (comma-separated key=value pairs) | "app=nginx,environment=production" |
containerRegex |
string | No | Container name regex match | "^(nginx\|app-container)$" |
podTargetLabels*: Specifies the list of label keys to extract from Pod Labels and attach to logs. logfwd will read the /etc/podinfo/labels file (injected by Downward API or DataKit-Operator), extract matching labels and add them to the log tags.
Configuration Hot Update Mechanism:
- logfwd polls the DataKit-Operator API once per minute
- Determines if there are changes by calculating the MD5 value of the configuration content
- After configuration changes, automatically stops old tailers and creates new tailers without restarting the container
- Configuration changes usually take effect within 1 minute
Topic
- Log directories need to be shared in advance using
volumes/volumeMountsin the business Pod/sidecar (such asemptyDir), otherwise logfwd cannot access log files. LOGFWD_LOG_CONFIGSand CRD configurations are independent of each other. If both point to the same path, duplicate collection will occur.- DataKit-Operator supports automatic injection of logfwd sidecar and mounts for target Pods. For details, please refer to the DataKit-Operator documentation.
ClusterLoggingConfig CRD Selector Support¶
When logfwd queries the ClusterLoggingConfig CRD through DataKit-Operator, it supports the following selector fields to match target CRDs and log collection configurations:
| Selector Field | Description | Example |
|---|---|---|
namespaceRegex |
Namespace name regex matching, using the LOGFWD_POD_NAMESPACE environment variable of the logfwd container as the query parameter |
"^(default)$" |
podNameRegex |
Pod name regex matching, using the LOGFWD_POD_NAME environment variable of the logfwd container as the query parameter |
"^(nginx-app.*)$" |
podLabelSelector |
Pod label selector (prerequisite: the /etc/podinfo/labels file in the logfwd container has labels content) |
"app=nginx,environment=production" |
Topic
- logfwd does not support the
containerRegexselector. Since logfwd runs as a Pod Sidecar, it only collects log files and cannot distinguish container names. - The use of
podLabelSelectordepends on the existence of the/etc/podinfo/labelsfile. DataKit-Operator automatically mounts this file when injecting the logfwd sidecar (via Downward API). If this file does not exist or is empty,podLabelSelectorwill not take effect. - All selector conditions have an AND relationship, meaning all specified selectors must match for a Pod to be selected.
Example: Kubernetes Pod Configuration¶
- Using DataKit-Operator CRD Configuration
apiVersion: v1
kind: Pod
metadata:
name: nginx-app
namespace: default
labels:
app: nginx
version: v1.0
spec:
containers:
- name: nginx
image: nginx:latest
volumeMounts:
- name: nginx-logs
mountPath: /var/log/nginx
- name: logfwd
image: pubrepo.guance.com/datakit/logfwd:1.87.2
env:
- name: LOGFWD_LOG_LEVEL
value: "info" # Optional: debug to see detailed logs
- name: LOGFWD_DATAKIT_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: LOGFWD_DATAKIT_PORT
value: "9533"
- name: LOGFWD_DATAKIT_OPERATOR_ENDPOINT
value: datakit-operator.datakit.svc:443
- name: LOGFWD_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: LOGFWD_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: LOGFWD_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
volumeMounts:
- name: podinfo
mountPath: /etc/podinfo
readOnly: true
- name: nginx-logs
mountPath: /var/log/nginx
readOnly: true
volumes:
- name: podinfo
downwardAPI:
items:
- path: "labels"
fieldRef:
fieldPath: metadata.labels
- name: nginx-logs
emptyDir: {}
Corresponding ClusterLoggingConfig CRD configuration:
apiVersion: logging.datakits.io/v1alpha1
kind: ClusterLoggingConfig
metadata:
name: nginx-logs
spec:
selector:
namespaceRegex: "^default$"
podLabelSelector: "app=nginx"
podTargetLabels:
- app
- version
configs:
- type: "file"
source: "nginx-access"
path: "/var/log/nginx/access.log"
pipeline: "nginx-access.p"
- type: "file"
source: "nginx-error"
path: "/var/log/nginx/error.log"
pipeline: "nginx-error.p"
- Using Manual Configuration
If you need to temporarily use manual configuration or debugging, you can add the LOGFWD_LOG_CONFIGS environment variable:
spec:
containers:
- name: logfwd
image: pubrepo.guance.com/datakit/logfwd:1.87.2
env:
- name: LOGFWD_DATAKIT_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: LOGFWD_DATAKIT_PORT
value: "9533"
- name: LOGFWD_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: LOGFWD_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: LOGFWD_LOG_CONFIGS
value: |
[
{
"type": "file",
"source": "app-logs",
"path": "/var/log/app/*.log",
"pipeline": "app.p",
"from_beginning": false,
"tags": {
"env": "production"
}
}
]
volumeMounts:
- name: app-logs
mountPath: /var/log/app
readOnly: true
volumes:
- name: app-logs
emptyDir: {}
The mounting patterns, volumes/volumeMounts syntax, resource limits, etc. remain consistent with versions before 1.86.0, and you can continue to refer to the old version examples in the next section.
logfwd Usage and Configuration (Before 1.86.0)¶
The logfwd main configuration is in JSON format, and the following is a configuration example:
[
{
"datakit_addr": "127.0.0.1:9533",
"loggings": [
{
"logfiles": ["<your-logfile-path>"],
"ignore": [],
"storage_index": "<your-storage-index>",
"source": "<your-source>",
"service": "<your-service>",
"pipeline": "<your-pipeline.p>",
"character_encoding": "",
"multiline_match": "<your-match>",
"tags": {}
},
{
"logfiles": ["<your-logfile-path-2>"],
"source": "<your-source-2>"
}
]
}
]
Description of configuration parameters:
-
datakit_addris the DataKit logfwdserver address, typically configured with the environment variablesLOGFWD_DATAKIT_HOSTandLOGFWD_DATAKIT_PORT -
loggingsis the primary configuration, an array, and the subitems are basically the same as the logging collector.logfileslist of log files, you can specify absolute paths, support batch specifying using glob rules, and recommend using absolute paths.ignorefile path filtering, using glob rules, the file will not be collected if any filtering condition is met.storage_indexset storage indexsourcedata source; if empty, 'default' is used by default.serviceadds tag; if empty, $source is used by default.pipelinePipeline script path, if empty $source.p will be used, if $source.p does not exist will not use Pipeline (this script file exists on the DataKit side).character_encoding# Select the code. If there is a misunderstanding in the code and the data cannot be viewed, it will be empty by default. Supportutf-8,utf-16le,utf-16le,gbk,gb18030or ""multiline_matchmulti-line match, as in the logging configuration, note that "no escape writing" with 3 single quotes is not supported because it is in JSON format, and regular^\d{4}needs to be escaped as^\\d{4}tagsadds additionaltagwritten in a JSON map, such as{ "key1":"value1", "key2":"value2" }
Supported environment variables:
| Environment Variable Name | Configuration Item Meaning |
|---|---|
LOGFWD_DATAKIT_HOST |
DataKit Address |
LOGFWD_DATAKIT_PORT |
DataKit Port |
LOGFWD_GLOBAL_SOURCE |
Configure the global source with the highest priority |
LOGFWD_GLOBAL_STORAGE_INDEX |
Configure the global storage_index with the highest priority |
LOGFWD_GLOBAL_SERVICE |
Configure the global service with the highest priority |
LOGFWD_POD_NAME |
Specifying pod name adds pod_name to tags |
LOGFWD_POD_NAMESPACE |
Specifying pod namespace adds namespace to tags |
LOGFWD_ANNOTATION_DATAKIT_LOGS |
Use the annotations datakit/logs configuration of the current Pod with higher priority than the logfwd JSON configuration |
LOGFWD_JSON_CONFIG |
Logfwd main configuration, i.e. the JSON-formatted text above |
Installation and Running¶
The deployment configuration of logfwd in Kubernetes is divided into two parts. One is the configuration of Kubernetes Pod to create spec.containers, including injecting environment variables and mounting directories. The configuration is as follows:
spec:
containers:
- name: logfwd
env:
- name: LOGFWD_DATAKIT_HOST
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: LOGFWD_DATAKIT_PORT
value: "9533"
- name: LOGFWD_ANNOTATION_DATAKIT_LOGS
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.annotations['datakit/logs']
- name: LOGFWD_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: LOGFWD_POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: LOGFWD_GLOBAL_SOURCE
value: nginx-souce-test
image: pubrepo.guance.com/datakit/logfwd:1.87.2
imagePullPolicy: Always
resources:
requests:
cpu: "200m"
memory: "128Mi"
limits:
cpu: "1000m"
memory: "2Gi"
volumeMounts:
- mountPath: /opt/logfwd/config
name: logfwd-config-volume
subPath: config
workingDir: /opt/logfwd
volumes:
- configMap:
name: logfwd-config
name: logfwd-config-volume
The second configuration is the configuration where logfwd actually runs, the JSON-formatted master configuration mentioned earlier, which exists in Kubernetes as a ConfigMap.
According to the logfwd configuration example, modify config as it is. The ConfigMap format is as follows:
apiVersion: v1
kind: ConfigMap
metadata:
name: logfwd-conf
data:
config: |
[
{
"loggings": [
{
"logfiles": ["/var/log/1.log"],
"source": "log_source",
"tags": {}
},
{
"logfiles": ["/var/log/2.log"],
"source": "log_source2"
}
]
}
]
By integrating the two configurations into the existing Kubernetes yaml and using volumes and volumeMounts to share directories within containers, the logfwd container collects log files from other containers.
Note that you need to use
volumesandvolumeMountsto mount and share the log directory of the application container (that is, thecountcontainer in the example) for normal access in the logfwd container. Seevolumesdoc
The complete example is as follows:
apiVersion: v1
kind: Pod
metadata:
name: logfwd
spec:
containers:
- name: count
image: busybox
args:
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)" >> /var/log/1.log;
echo "$(date) INFO $i" >> /var/log/2.log;
i=$((i+1));
sleep 1;
done
volumeMounts:
- name: varlog
mountPath: /var/log
- name: logfwd
env:
- name: LOGFWD_DATAKIT_HOST
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: LOGFWD_DATAKIT_PORT
value: "9533"
- name: LOGFWD_ANNOTATION_DATAKIT_LOGS
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.annotations['datakit/logs']
- name: LOGFWD_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: LOGFWD_POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: pubrepo.guance.com/datakit/logfwd:1.87.2
imagePullPolicy: Always
resources:
requests:
cpu: "200m"
memory: "128Mi"
limits:
cpu: "1000m"
memory: "2Gi"
volumeMounts:
- name: varlog
mountPath: /var/log
- mountPath: /opt/logfwd/config
name: logfwd-config-volume
subPath: config
workingDir: /opt/logfwd
volumes:
- name: varlog
emptyDir: {}
- configMap:
name: logfwd-config
name: logfwd-config-volume
---
apiVersion: v1
kind: ConfigMap
metadata:
name: logfwd-config
data:
config: |
[
{
"loggings": [
{
"logfiles": ["/var/log/1.log"],
"source": "log_source",
"tags": {
"flag": "tag1"
}
},
{
"logfiles": ["/var/log/2.log"],
"source": "log_source2"
}
]
}
]
Performance Test¶
- Environment:
- Log file contains 1000w nginx logs, file size 2.2 GB:
192.168.17.1 - - [06/Jan/2022:16:16:37 +0000] "GET /google/company?test=var1%20Pl HTTP/1.1" 401 612 "http://www.google.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"
- Results:
It takes95 seconds to read and forward all logs, with an average of 10w logs read per second.
The peak single-core CPU utilization rate was 42%, and the following is the top record at that time:
top - 16:32:46 up 52 days, 7:28, 17 users, load average: 2.53, 0.96, 0.59
Tasks: 464 total, 2 running, 457 sleeping, 0 stopped, 5 zombie
%Cpu(s): 30.3 us, 33.7 sy, 0.0 ni, 34.3 id, 0.1 wa, 0.0 hi, 1.5 si, 0.0 st
MiB Mem : 15885.2 total, 985.2 free, 6204.0 used, 8696.1 buff/cache
MiB Swap: 2048.0 total, 0.0 free, 2048.0 used. 8793.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1850829 root 20 0 715416 17500 8964 R 42.1 0.1 0:10.44 logfwd
More Readings¶
- DataKit summary of log collection
- Socket Log access best practices
- Log collection configuration for specifying pod in Kubernetes
- Third-party log access
- Introduction of DataKit configuration mode in Kubernetes environment
- Install DataKit as DaemonSet
- Deploy
logfwdserveron DataKit - Proper use of regular expressions to configure