DataKit 主配置¶

DataKit 主配置用来配置 DataKit 自己的运行行为。

主机部署Kubernetes

其目录一般位于：

Linux/Mac: /usr/local/datakit/conf.d/datakit.conf
Windows: C:\Program Files\datakit\conf.d\datakit.conf

DaemonSet 安装时，虽然在对应目录下也存在这个文件，但实际上 DataKit 并不加载这里的配置。这些配是通过在 datakit.yaml 中注入环境变量来生成的。下面所有的配置，都能在 Kubernetes 部署文档中找到对应的环境变量配置。

DataKit 主配置示例¶

DataKit 主配置示例如下，我们可以根据该示例来开启各种功能（当前版本 1.76.1）：

datakit.conf

################################################
# Global configures
################################################
# Default enabled input list.
default_enabled_inputs = [
  "cpu",
  "disk",
  "diskio",
  "host_processes",
  "hostobject",
  "mem",
  "net",
  "swap",
  "system",
]

# enable_pprof: bool
# If pprof enabled, we can profiling the running datakit
enable_pprof = true
pprof_listen = "localhost:6060" # pprof listen

# protect_mode: bool, default false
# When protect_mode eanbled, we can set radical collect parameters, these may cause Datakit
# collect data more frequently.
protect_mode = true

# The user name running datakit. Generally for audit purpose. Default is root.
datakit_user = "root"

################################################
# ulimit: set max open-files limit(Linux only)
################################################
ulimit = 64000

################################################
# point_pool: use point pool for better memory usage
################################################
[point_pool]
  enable = false
  reserved_capacity = 4096

################################################
# DCA configure
################################################
[dca]
  # Enable or disable DCA
  enable = false

  # DCA websocket server address
  websocket_server = "ws://localhost:8000/ws"

################################################
# Upgrader 
################################################
[dk_upgrader]
  # host address
  host = "0.0.0.0"

  # port number
  port = 9542 

################################################
# Pipeline
################################################
[pipeline]
  # IP database type, support iploc and geolite2
  ipdb_type = "iploc"

  # How often to sync remote pipeline
  remote_pull_interval = "1m"

  #
  # reftab configures
  #
  # Reftab remote HTTP URL(https/http)
  refer_table_url = ""

  # How often reftab sync the remote
  refer_table_pull_interval = "5m"

  # use sqlite to store reftab data to release memory usage
  use_sqlite = false
  # or use pure memory to cache the reftab data
  sqlite_mem_mode = false

  # append run info
  disable_append_run_info = false

  # default pipeline
  [pipeline.default_pipeline]
    # logging = "<your_script.p>"
    # metric  = "<your_script.p>"
    # tracing = "<your_script.p>"

  # Offload data processing tasks to post-level data processors.
  [pipeline.offload]
    receiver = "datakit-http"
    addresses = [
      # "http://<ip>:<port>"
    ]

################################################
# HTTP server(9529)
################################################
[http_api]

  # HTTP server address
  listen = "localhost:9529"

  # Disable 404 page to hide detailed Datakit info
  disable_404page = false

  # only enable these APIs. If list empty, all APIs are enabled.
  public_apis = []

  # Datakit server-side timeout
  timeout = "30s"
  close_idle_connection = false

  # API rate limit(QPS)
  request_rate_limit = 20.0

  #
  # RUM related: we should port these configures to RUM inputs(TODO)
  #
  # When serving RUM(/v1/write/rum), extract the IP address from this HTTP header
  rum_origin_ip_header = "X-Forwarded-For"
  # When serving RUM(/v1/write/rum), only accept requests from these app-id.
  # If the list empty, all app's requests accepted.
  rum_app_id_white_list = []

  # only these domains enable CORS. If list empty, all domains are enabled.
  allowed_cors_origins = []

  # Start Datakit web server with HTTPS
  [http_api.tls]
    # cert = "path/to/certificate/file"
    # privkey = "path/to/private_key/file"

################################################
# io configures
################################################
[io]
  # How often Datakit flush data to dataway.
  # Datakit will upload data points if cached(in memory) points
  #  reached(>=) the max_cache_count or the flush_interval triggered.
  max_cache_count = 1000
  flush_workers   = 0 # default to (cpu_core * 2)
  flush_interval  = "10s"

  # Queue size of feed.
  feed_chan_size = 1

  # Set blocking if queue is full.
  # NOTE: Global blocking mode may consume more memory on large metric points.
  global_blocking = false

  # Data point filter configures.
  # NOTE: Most of the time, you should use web-side filter, it's a debug helper for developers.
  #[io.filters]
  #  logging = [
  #   "{ source = 'datakit' or f1 IN [ 1, 2, 3] }"
  #  ]
  #  metric = [
  #    "{ measurement IN ['datakit', 'disk'] }",
  #    "{ measurement CONTAIN ['host.*', 'swap'] }",
  #  ]
  #  object = [
  #    { class CONTAIN ['host_.*'] }",
  #  ]
  #  tracing = [
  #    "{ service = re("abc.*") AND some_tag CONTAIN ['def_.*'] }",
  #  ]

[recorder]
  enabled = false
  #path = "/path/to/point-data/dir"
  encoding = "v2"  # use protobuf-json format
  duration = "30m" # record for 30 minutes

  # only record these inputs, if empty, record all
  inputs = [
    #"cpu",
    #"mem",
  ]

  # only record these categoris, if empty, record all
  category = [
    #"logging",
    #"object",
  ]

################################################
# Dataway configure
################################################
[dataway]
  # urls: Dataway URL list
  # NOTE: do not configure multiple URLs here, it's a deprecated feature.
  urls = [
    # "https://openway.guance.com?token=<YOUR-WORKSPACE-TOKEN>"
  ]

  # Dataway HTTP timeout
  timeout_v2 = "30s"

  # max_retry_count specifies at most how many times will be tried when dataway API fails(not 4xx),
  # default value(and minimal) is 1 and maximum value is 10.
  #
  # The default set to 1 to makes the API fails ASAP to release memroy.
  max_retry_count = 1

  # The interval between two retry operation, valid time units are "ns", "us", "ms", "s", "m", "h"
  retry_delay = "1s"

  # HTTP Proxy
  # Format: "http(s)://IP:Port"
  http_proxy = ""

  max_idle_conns   = 0       # limit idle TCP connections for HTTP request to Dataway
  enable_httptrace = false   # enable trace HTTP metrics(connection/NDS/TLS and so on)
  idle_timeout     = "90s"   # not-set, default 90s

  # HTTP body content type, other candidates are(case insensitive):
  #  - v1: line-protocol
  #  - v2: protobuf
  content_encoding = "v2"

  # Enable GZip to upload point data.
  #
  # do NOT disable gzip or your get large network payload.
  gzip = true

  max_raw_body_size = 1048576 # max body size(before gizp) in bytes

  # Customer tag or field keys that will extract from exist points
  # to build the X-Global-Tags HTTP header value.
  global_customer_keys = []
  enable_sinker        = false # disable sinker

  # use dataway as NTP server
  [dataway.ntp]
    enable   = true
    interval = "5m"  # sync dataway time each 5min(minimal 1min)

    # if abs(datakit time - dataway time) >= diff, datakit will adjust data point
    # time with dataway time.
    diff     = "30s"  # minimal 5s

  # WAL queue for uploading points
  [dataway.wal]
    max_capacity_gb = 2.0 # 2GB reserved disk space for each category(M/L/O/T/...)
    #workers = 4          # flush workers on WAL(default to CPU limited cores)
    #mem_cap = 4          # in-memory queue capacity(default to CPU limited cores)
    #fail_cache_clean_interval = "30s" # duration for clean fail uploaded data
    #no_drop_categories = ["L"]        # category list that disable drop data when disk cache full


################################################
# Datakit logging configure
################################################
[logging]

  # log path
  log = "/var/log/datakit/log"

  # HTTP access log
  gin_log = "/var/log/datakit/gin.log"

  # level level(info/debug)
  level = "info"

  # Disable log color
  disable_color = false

  # log rotate size(in MB)
  # DataKit will always keep at most n+1(n backup log and 1 writing log) splited log files on disk.
  rotate = 32

  # Upper limit count of backup log
  rotate_backups = 5

################################################
# Global tags
################################################
# We will try to add these tags to every collected data point if these
# tags do not exist in orignal data.
#
# NOTE: we can get the real IP of current note, we just need
# to set "$datakit_ip" or "__datakit_ip" here. Same for the hostname.
[global_host_tags]
  ip   = "$datakit_ip"
  host = "$datakit_hostname"

[election]
  # Enable election
  enable = false

  # Election whitelist
  # NOTE: Empty to disable whitelist
  node_whitelist = []

  # Election namespace.
  # NOTE: for single workspace, there can be multiple election namespace.
  namespace = "default"

  # If enabled, every data point will add a tag with election_namespace = <your-election-namespace>
  enable_namespace_tag = false

  # Like global_host_tags, but only for data points that are remotely collected(such as MySQL/Nginx).
  [election.tags]
    #  project = "my-project"
    #  cluster = "my-cluster"

###################################################
# Tricky: we can rename the default hostname here
###################################################
[environments]
  ENV_HOSTNAME = ""

################################################
# resource limit configures
################################################
[resource_limit]

  # enable or disable resource limit
  enable = true

  # Linux only, cgroup path
  path = "/datakit"

  # Limit CPU cores
  cpu_cores = 2.0

  # set max memory usage(MB)
  mem_max_mb = 4096

################################################
# git_repos configures
################################################

# We can hosting all input configures on git server
[git_repos]
  # git pull interval
  pull_interval = "1m"

  # git repository settings
  [[git_repos.repo]]
    # enable the repository or not
    enable = false

    # the branch name to pull
    branch = "master"

    # git repository URL. There are 3 formats here:
    #   - HTTP(s): such as "https://github.datakit.com/path/to/datakit-conf.git"
    #   - Git: such as "git@github.com:path/to/datakit.git"
    #   - SSH: such as "ssh://git@github.com:9000/path/to/repository.git"
    url = ""

    # For formats Git and SSH, we need extra configures:
    ssh_private_key_path = ""
    ssh_private_key_password = ""

################################################
# crypto key or key filePath.
################################################
[crypto]
  aes_key = ""
  aes_Key_file = ""

[remote_job]
  enable=false
  envs = ["OSS_BUCKET_HOST=host","OSS_ACCESS_KEY_ID=key","OSS_ACCESS_KEY_SECRET=secret","OSS_BUCKET_NAME=bucket"]
  interval = "30s"
  java_home=""

HTTP 服务的配置¶

DataKit 会开启 HTTP 服务，用来接收外部数据，或者对外提供基础的数据服务。

datakit.confKubernetes

修改 HTTP 服务地址¶

默认的 HTTP 服务地址是 localhost:9529，如果 9529 端口被占用，或希望从外部访问 DataKit 的 HTTP 服务（比如希望接收 RUM 或 Tracing 数据），可将其修改成：

[http_api]
   listen = "0.0.0.0:<other-port>"
   # 或使用 IPV6 地址
   # listen = "[::]:<other-port>"

注意，IPv6 支持需 DataKit 升级到 1.5.7。

使用 Unix domain socket¶

DataKit 支持 UNIX domain sockets 访问。开启方式如下：listen 字段配置为一个不存在文件的全路径，这里以 datakit.sock 举例，可以为任意文件名。

[http_api]
   listen = "/tmp/datakit.sock"

配置完成后可以使用 curl 命令测试是否配置成功：sudo curl --no-buffer -XGET --unix-socket /tmp/datakit.sock http:/localhost/v1/ping。更多关于 curl 的测试命令的信息可以参阅这里。

HTTP 请求频率控制¶

Version-1.62.0 已经默认开启该功能。

由于 DataKit 需要大量接收外部数据写入，为了避免给所在节点造成巨大开销，DataKit 默认给 API 设置了 20/s 的 QPS 限制：

[http_api]
  request_rate_limit = 20.0 # 限制每个客户端（IP + API 路由）每秒发起请求的 QPS 限制

  # 如果确实有大量数据写入，可酌情调大限制，避免数据丢失（请求超限后客户端会收到 HTTP 429 错误码）

其它设置¶

[http_api]
    close_idle_connection = true # 关闭闲置连接
    timeout = "30s"              # 设置服务端 HTTP 超时

参见这里

HTTP API 访问控制¶

Version-1.64.0

出于安全考虑，DataKit 默认限制了一些自身 API 的访问（这些 API 只能通过 localhost 访问）。如果 DataKit 部署在公网环境，又需要通过其它机器或公网来请求这些 API，可以在 datakit.conf 中，修改如下 public_apis 字段配置：

[http_api]
  public_apis = [
    # 放行 DataKit 自身指标暴露接口 /metrics
    "/metrics",
    # ... # 其它接口
  ]

默认情况下，public_apis 为空。出于便捷和兼容性考虑，默认只开放了部分接口，所有其它接口都是禁止外部访问的。而采集器对应的接口，比如 trace 类采集器，一旦开启采集器之后，其访问自动放开，默认就能外部访问。

Kubernetes 中增加 API 白名单参见这里。

Warning

一旦 public_apis 不为空，则默认开启的那些 API 接口需要再次手动添加：

[http_api]
  public_apis = [
    "/v1/write/metric",
    "/v1/write/logging",
    # ...
  ]

全局标签（Tag）修改¶

Version-1.4.6

DataKit 允许给其采集的所有数据配置全局标签，全局标签分为两类：

主机类全局标签（GHT）：采集的数据跟当前主机绑定，比如 CPU/内存等指标数据
选举类全局标签（GET）：采集的数据来自某个公共（远程）实体，比如 MySQL/Redis 等，这些采集一般都参与选举，故这些数据上不会带上当前主机相关的标签

[global_host_tags] # 这里面的我们称之为「全局主机标签」
  ip   = "__datakit_ip"
  host = "__datakit_hostname"

[election]
  [election.tags] # 这里面的我们称之为「全局选举标签」
    project = "my-project"
    cluster = "my-cluster"

加全局标签时，有几个地方要注意：

这些全局标签的值可以用 DataKit 目前已经支持的几个通配（双下划线（__）前缀和 $ 都是可以的）：
1. __datakit_ip/$datakit_ip：标签值会设置成 DataKit 获取到的第一个主网卡 IP
2. __datakit_hostname/$datakit_hostname：标签值会设置成 DataKit 的主机名
由于 DataKit 数据传输协议限制，不要在全局标签（Tag）中出现任何指标（Field）字段，否则会因为违反协议导致数据处理失败。具体参见具体采集器的字段列表。当然，也不要加太多标签，而且每个标签的 Key 以及 Value 长度都有限制。
如果被采集上来的数据中，本来就带有同名的标签，那么 DataKit 不会再追加这里配置的全局标签
即使 GHT 中没有任何配置，DataKit 仍然会在其中添加一个 host=__datakit_hostname 的标签。因为 hostname 是目前观测云平台数据关联的默认字段，故日志/CPU/内存等采集上，都会带上 host 这个 tag。
这俩类全局标签（GHT/GET）是可以有交集的，比如都可以在其中设置一个 project = "my-project" 的标签
当没有开启选举的情况下，GET 沿用 GHT（它至少有一个 host 的标签）中的所有标签
选举类采集器默认追加 GET，非选举类采集器默认追加 GHT。

如何区分选举和非选举采集器？

在采集器文档中，在顶部有类似如下标识，它们表示当前采集器的平台适配情况以及采集特性：

·

若带有则表示当前采集器是选举类采集器。

全局 Tag 在远程采集时的设置¶

因为 DataKit 会默认给采集到的所有数据追加标签 host=<DataKit 所在主机名>，但某些情况这个默认追加的 host 会带来困扰。

以 MySQL 为例，如果 MySQL 不在 DataKit 所在机器，但又希望这个 host 标签是被采集的 MySQL 的真实主机名（或云数据库的其它标识字段），而非 DataKit 所在的主机名。

对这种情况，我们有两种方式可以绕过 DataKit 上的全局 tag：

在具体采集器中，一般都有一个如下配置，我们可以在这里面新增 Tag，比如，如果不希望 DataKit 默认添加 host=xxx 这个 Tag，可以在这里覆盖这个 Tag，以 MySQL 为例：

[[inputs.mysql.tags]]
  host = "real-mysql-host-name"

以 HTTP API 方式往 DataKit 推送数据时，可以通过 API 参数 ignore_global_tags 来屏蔽所有全局 Tag

Info

自 1.4.20 之后，DataKit 默认会以被采集服务连接地址中的的 IP/Host 作为 host 的标签值。

DataKit 自身运行日志配置¶

DataKit 自身日志有两个，一个是自身运行日志（/var/log/datakit/log），一个是 HTTP Access 日志（/var/log/datakit/gin.log）。

DataKit 默认日志等级为 info。编辑 datakit.conf，可修改日志等级以及分片大小：

[logging]
  level = "debug" # 将 info 改成 debug
  rotate = 32     # 每个日志分片为 32MB

level：置为 debug 后，即可看到更多日志（目前只支持 debug/info 两个级别）。
rotate：DataKit 默认会对日志进行分片，默认分片大小为 32MB，总共 6 个分片（1 个当前写入分片加上 5 个切割分片，分片个数尚不支持配置）。如果嫌弃 DataKit 日志占用太多磁盘空间（最多 32 x 6 = 192MB），可减少 rotate 大小（比如改成 4，单位为 MB）。HTTP 访问日志也按照同样的方式自动切割。

高级配置¶

下面涉及的内容涉及一些高级配置，如果对配置不是很有把握，建议咨询我们的技术专家。

时间校准¶

Version-1.75.0

为避免本机时间偏差对数据采集的影响，DataKit 可通过调用 DataWay 接口（ Version-1.6.0）来感知自身时间是否出现较大偏差。当感知到较大偏差后，DataKit 会校准当前时间（但不会修改系统时间）作为数据采集的时间。

在 datakit.conf 中，有如下配置项：

  # use dataway as NTP server
  [dataway.ntp]
    enable   = true  # default enabled
    interval = "5m"  # sync dataway time each 5min(minimal 1min)

    # if abs(datakit time - dataway time) >= diff, datakit will adjust data point
    # time with dataway time.
    diff     = "30s"  # minimal 5s

Warning

该行为默认开启，如果 DataWay 版本较低，最终效果仍旧是采用当前系统时间（即不做任何校准）
目前 eBPP 相关的采集，由于其与 DataKit 是分离运行的，暂不支持时间矫正功能

IO 模块调参¶

Version-1.4.8 · Experimental

datakit.confKubernetes

某些情况下，DataKit 的单机数据采集量非常大，如果网络带宽有限，可能导致部分数据的采集中断或丢弃。可以通过配置 io 模块的一些参数来缓解这一问题：

[io]
  feed_chan_size  = 1     # 数据处理队列长度
  max_cache_count = 1000  # 数据批量发送点数的阈值，缓存中超过该值即触发发送
  flush_interval  = "10s" # 数据发送的间隔阈值，每隔 10s 至少发送一次
  flush_workers   = 0     # 数据上传 worker 数（默认配额 CPU 核心 * 2）

阻塞模式参见 k8s 中的对应说明

参见这里

资源限制¶

由于 DataKit 上处理的数据量无法估计，如果不对 DataKit 消耗的资源做物理限制，将有可能消耗所在节点大量资源。这里我们可以借助 Linux 的 cgroup 和 Windows 的 job object 来限制，在 datakit.conf 中有如下配置：

[resource_limit]
  path = "/datakit" # Linux cgroup 限制目录，如 /sys/fs/cgroup/memory/datakit, /sys/fs/cgroup/cpu/datakit

  # 允许 CPU 核心数
  cpu_cores = 2.0

  cpu_max = 20.0 # 已弃用

  # 默认允许 4GB 内存(memory + swap)占用
  # 如果置为 0 或负数，则不启用内存限制
  mem_max_mb = 4096

如果 DataKit 超出内存限制后，会被操作系统强制杀掉，通过命令可以看到如下结果，此时需要手动启动服务：

$ systemctl status datakit 
● datakit.service - Collects data and upload it to DataFlux.
     Loaded: loaded (/etc/systemd/system/datakit.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: signal) since Fri 2022-02-30 16:39:25 CST; 1min 40s ago
    Process: 3474282 ExecStart=/usr/local/datakit/datakit (code=killed, signal=KILL)
   Main PID: 3474282 (code=killed, signal=KILL)

Note

资源限制只在宿主机安装的时候会默认开启
只支持 CPU 使用率和内存使用量（mem+swap）控制，且只支持 Linux 和 windows ( Version-1.15.0) 操作系统。
CPU 使用率控制目前不支持这些 windows 操作系统： Windows 7, Windows Server 2008 R2, Windows Server 2008, Windows Vista, Windows Server 2003 和 Windows XP。
非 root 用户改资源限制配置时，必须重装 service。
CPU 核心数限制会影响 DataKit 部分子模块的 worker 数配置（一般是 CPU 核心数的整数倍）。比如数据上传 worker 就是 CPU 核心数 * 2。而单个上传 worker 会占用默认 10MB 的内存用于数据发送，故 CPU 核心数如果开放较多，会影响 DataKit 整体内存的占用
Version-1.5.8 开始支持 cgroup v2。如果不确定 cgroup 版本，可通过命令 mount | grep cgroup 来确认。
Version-1.68.0 支持在 daktait.conf 中配置 CPU 核心数限制，且弃用原来的百分比配置方式。百分比配置方式会因为不同主机的 CPU 核心数不同而出现 CPU 配额不同，在采集压力相同的情况下，可能会导致一些异常行为。老版本 DataKit 升级上来的时候，在升级命令中指定 DK_LIMIT_CPUCORES 环境变量即可。升级命令如果不指定，仍然沿用之前的百分比配置方式。如果重新安装 DataKit，则直接采用 CPU 核心数限额方式。
cpu_max: CPU 使用率是百分比制（最大值 100.0），以一个 8 核心的 CPU 为例，如果限额 cpu_max 为 20.0（即 20%），则 DataKit 最大的 CPU 消耗，在 top 命令上将显示为 160% 左右。

cgroup 设置失败

某些主机上，由于 DataKit 自动检测到 cgroup v1，在生成对应的 cgroup 规则时会报错：

cgroup setup err=...: open /sys/fs/cgroup/memory/datakit/memory.memsw.limit_in_bytes: permission denied

该错误不是因为当前用户权限不够，而是因为内核中并未启用 Swap Accounting 所致。确认是否启用 Swap Accounting：

# 看看是否有 swapaccount=1 或 cgroup.memory=swapaccount=1
cat /proc/cmdline

如果缺失，需要修改 /etc/default/grub 中的 GRUB_CMDLINE_LINUX 或 GRUB_CMDLINE_LINUX_DEFAULT，在尾部添加 swapaccount=1，然后运行如下命令，并重启机器：

sudo update-grub # Debian/Ubuntu
# 或
sudo grub2-mkconfig -o /boot/grub2/grub.cfg # CentOS/RHEL/Fedora。

选举配置¶

参见这里

DataWay 参数配置¶

Dataway 部分有如下几个配置可以配置，其它部分不建议改动：

timeout：上传观测云的超时时间，默认 30s
max_retry_count：设置 Dataway 发送的重试次数（默认 1 次，最大 10 次） Version-1.17.0
retry_delay：设置重试间隔基础步长，默认 1s。所谓基础步长，即第一次 1s，第二次 2s，第三次 4s，以此类推（以 2^n 递增） Version-1.17.0
max_raw_body_size：控制单个上传包的最大大小（压缩前），单位字节 Version-1.17.1
content_encoding：可选择 v1 或 v2 Version-1.17.1
- v1 即行协议（默认 v1）
- v2 即 Protobuf 协议，相比 v1，它各方面的性能都更优越。运行稳定后，后续将默认采用 v2

Kubernetes 下部署相关配置参见这里。

WAL 队列配置¶

Version-1.60.0

WAL 用于缓存 DataKit 来不及上传的数据，当突发有较大的数据采集时，如果来不及发送，DataKit 会将其写入磁盘队列，避免阻塞数据采集，影响数据的实时性。

WAL 磁盘队列有默认的磁盘大小限制，当缓存数据量超过该限制，新采集的数据就写不进去导致丢弃。如果不希望丢弃这些数据，可以将该数据类型（一般是日志 L）配置到 no_drop_categories 列表中。此时数据不会主动丢弃，但会阻塞数据采集。

在 [dataway.wal] 中，我们可以调整 WAL 队列的配置：

  [dataway.wal]
     max_capacity_gb = 2.0             # 2GB reserved disk space for each category(M/L/O/T/...)
     workers = 0                       # flush workers on WAL(default to CPU limited cores)
     mem_cap = 0                       # in-memory queue capacity(default to CPU limited cores)
     fail_cache_clean_interval = "30s" # duration for clean fail uploaded data
     #no_drop_categories = ["L"]       # category list that do not drop data if WAL disk full

磁盘文件位于 DataKit 安装目录的 cache/dw-wal 目录下：

/usr/local/datakit/cache/dw-wal/
├── custom_object
│   └── data
├── dialtesting
│   └── data
├── dynamic_dw
│   └── data
├── fc
│   └── data
├── keyevent
│   └── data
├── logging
│   ├── data
│   └── data.00000000000000000000000000000000
├── metric
│   └── data
├── network
│   └── data
├── object
│   └── data
├── profiling
│   └── data
├── rum
│   └── data
├── security
│   └── data
└── tracing
    └── data

13 directories, 14 files

此处，除了 fc 是失败重传队列，其它目录分别对应一种数据类型。当数据上传失败，这些数据会缓存到 fc 目录下，后续 DataKit 会间歇性将它们上传上去。

如果当前主机磁盘性能不足，可以尝试 tmpfs 下使用 WAL。

支持通过本地设置默认 Pipeline 脚本，如果与远程设置的默认脚本冲突，则倾向本地设置。

可通过两种方式配置：

主机方式部署，可在 DataKit 主配置文件中指定各类别的默认脚本，如下：

# default pipeline
[pipeline.default_pipeline]
    # logging = "<your_script.p>"
    # metric  = "<your_script.p>"
    # tracing = "<your_script.p>"

容器方式部署，可使用环境变量，ENV_PIPELINE_DEFAULT_PIPELINE，其值例如 {"logging":"abc.p","metric":"xyz.p"}

设置打开的文件描述符的最大值¶

Linux 环境下，可以在 DataKit 主配置文件中配置 ulimit 项，以设置 DataKit 的最大可打开文件数，如下：

ulimit = 64000

ulimit 默认配置为 64000。在 Kubernetes 中，通过设置 ENV_ULIMIT 即可。

采集器密码保护¶

Version-1.31.0

如果您希望避免在配置文件中以明文存储密码，则可以使用该功能。

DataKit 在启动加载采集器配置文件时遇到 ENC[] 时会在文件、env、或者 AES 加密得到密码后替换文本并重新加载到内存中，以得到正确的密码。

ENC 目前支持三种方式：

文件形式（推荐）：

配置文件中密码格式： ENC[file:///path/to/enc4dk] ，在对应的文件中填写正确的密码即可。
AES 加密方式。

需要在主配置文件 datakit.conf 中配置秘钥： crypto_AES_key 或者 crypto_AES_Key_filePath, 秘钥长度是 16 位。密码处的填写格式为： ENC[aes://5w1UiRjWuVk53k96WfqEaGUYJ/Oje7zr8xmBeGa3ugI=]

接下来以 mysql 为例，说明两种方式如何配置使用：

1 文件形式

首先，将明文密码放到文件 /usr/local/datakit/enc4mysql 中，然后修改配置文件 mysql.conf:

# 部分配置
[[inputs.mysql]]
  host = "localhost"
  user = "datakit"
  pass = "ENC[file:///usr/local/datakit/enc4mysql]"
  port = 3306
  # sock = "<SOCK>"
  # charset = "utf8"

DK 会从 /usr/local/datakit/enc4mysql 中读取密码并替换密码，替换后为 pass = "Hello*******"

2 AES 加密方式

首先在 datakit.conf 中配置秘钥：

# crypto key or key filePath.
[crypto]
  # 配置秘钥
  aes_key = "0123456789abcdef"
  # 或者，将秘钥放到文件中并在此配置文件位置。
  aes_Key_file = "/usr/local/datakit/mykey"

mysql.conf 配置文件：

pass = "ENC[aes://5w1UiRjWuVk53k96WfqEaGUYJ/Oje7zr8xmBeGa3ugI=]"

注意，通过 AES 加密得到的密文需要完整的填入。以下是代码示例：

GolangJava

// AESEncrypt  加密。
func AESEncrypt(key []byte, plaintext string) (string, error) {
    block, err := aes.NewCipher(key)
    if err != nil {
        return "", err
    }

    // PKCS7 padding
    padding := aes.BlockSize - len(plaintext)%aes.BlockSize
    padtext := bytes.Repeat([]byte{byte(padding)}, padding)
    plaintext += string(padtext)
    ciphertext := make([]byte, aes.BlockSize+len(plaintext))
    iv := ciphertext[:aes.BlockSize]
    if _, err := io.ReadFull(rand.Reader, iv); err != nil {
        return "", err
    }
    mode := cipher.NewCBCEncrypter(block, iv)
    mode.CryptBlocks(ciphertext[aes.BlockSize:], []byte(plaintext))

    return base64.StdEncoding.EncodeToString(ciphertext), nil
}

// AESDecrypt AES  解密。
func AESDecrypt(key []byte, cryptoText string) (string, error) {
    ciphertext, err := base64.StdEncoding.DecodeString(cryptoText)
    if err != nil {
        return "", err
    }

    block, err := aes.NewCipher(key)
    if err != nil {
        return "", err
    }

    if len(ciphertext) < aes.BlockSize {
        return "", fmt.Errorf("ciphertext too short")
    }

    iv := ciphertext[:aes.BlockSize]
    ciphertext = ciphertext[aes.BlockSize:]

    mode := cipher.NewCBCDecrypter(block, iv)
    mode.CryptBlocks(ciphertext, ciphertext)

    // Remove PKCS7 padding
    padding := int(ciphertext[len(ciphertext)-1])
    if padding > aes.BlockSize {
        return "", fmt.Errorf("invalid padding")
    }
    ciphertext = ciphertext[:len(ciphertext)-padding]

    return string(ciphertext), nil
}

import javax.crypto.Cipher;
import javax.crypto.spec.IvParameterSpec;
import javax.crypto.spec.SecretKeySpec;
import java.security.SecureRandom;
import java.util.Base64;

public class AESUtils {
    public static String AESEncrypt(byte[] key, String plaintext) throws Exception {
        javax.crypto.Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding");
        SecretKeySpec secretKeySpec = new SecretKeySpec(key, "AES");

        SecureRandom random = new SecureRandom();
        byte[] iv = new byte[16];
        random.nextBytes(iv);
        IvParameterSpec ivParameterSpec = new IvParameterSpec(iv);
        cipher.init(Cipher.ENCRYPT_MODE, secretKeySpec, ivParameterSpec);
        byte[] encrypted = cipher.doFinal(plaintext.getBytes());
        byte[] ivAndEncrypted = new byte[iv.length + encrypted.length];
        System.arraycopy(iv, 0, ivAndEncrypted, 0, iv.length);
        System.arraycopy(encrypted, 0, ivAndEncrypted, iv.length, encrypted.length);

        return Base64.getEncoder().encodeToString(ivAndEncrypted);
    }

    public static String AESDecrypt(byte[] key, String cryptoText) throws Exception {
        byte[] ciphertext = Base64.getDecoder().decode(cryptoText);

        SecretKeySpec secretKeySpec = new SecretKeySpec(key, "AES");

        if (ciphertext.length < 16) {
            throw new Exception("ciphertext too short");
        }

        byte[] iv = new byte[16];
        System.arraycopy(ciphertext, 0, iv, 0, 16);
        byte[] encrypted = new byte[ciphertext.length - 16];
        System.arraycopy(ciphertext, 16, encrypted, 0, ciphertext.length - 16);

        Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding");
        IvParameterSpec ivParameterSpec = new IvParameterSpec(iv);
        cipher.init(Cipher.DECRYPT_MODE, secretKeySpec, ivParameterSpec);

        byte[] decrypted = cipher.doFinal(encrypted);

        return new String(decrypted);
    }
}
public static void main(String[] args) {
    try {
        String key = "0123456789abcdef"; // 16, 24, or 32 bytes AES key
        String plaintext = "HelloAES9*&.";
        byte[] keyBytes = key.getBytes("UTF-8");

        String encrypted = AESEncrypt(keyBytes, plaintext);
        System.out.println("Encrypted text: " + encrypted);

        String decrypt = AESDecrypt(keyBytes, encrypted);
        System.out.println("解码后的是："+decrypt);
    } catch (Exception e) {
        System.out.println(e);
        e.printStackTrace();
    }
}

K8S 环境下可以通过环境变量方式添加私钥：ENV_CRYPTO_AES_KEY 和 ENV_CRYPTO_AES_KEY_FILEPATH 可以参考：DaemonSet 安装-其他

远程任务¶

Version-1.63.0

DataKit 接收中心下发任务并执行。目前支持 JVM dump 功能。

该功能是执行 jmap 命令，生成一个 jump 文件，并上传到 OSS AWS S3 Bucket 或者 HuaWei Cloud OBS 中。

安装 DK 之后会在安装目录下 template/service-task 生成两个文件：jvm_dump_host_script.py 和 jvm_dump_k8s_script.py 前者是宿主机模式下的脚本，后者是 k8s 环境下的。

DK 启动之后会定时执行脚本，如果修改脚本那么 DK 重启之后会覆盖掉。

宿主机环境下，当前的环境需要有 python3 以及包。如果没有需要安装：

# 有 python3 环境
pip install requests
# 或者
pip3 install requests

# 如果需要上传到华为云 OBS 需要安装库：
pip install esdk-obs-python --trusted-host pypi.org

# 如果需要上传到 AWS S3 需要安装 boto3:
pip install boto3

通过环境变量可以控制上传到多个存储捅类型，以下是配置说明， k8s 环境同理：

# upload to OSS
[remote_job]
  enable = true
  envs = [
      "REMOTE=oss",
      "OSS_BUCKET_HOST=host","OSS_ACCESS_KEY_ID=key","OSS_ACCESS_KEY_SECRET=secret","OSS_BUCKET_NAME=bucket",
    ]
  interval = "30s"

# or upload to AWS:
[remote_job]
  enable = true
  envs = [
      "REMOTE=aws",
      "AWS_BUCKET_NAME=bucket","AWS_ACCESS_KEY_ID=AK","AWS_SECRET_ACCESS_KEY=SK","AWS_DEFAULT_REGION=us-west-2",
    ]
  interval = "30s"

# or upload to OBS:
[remote_job]
  enable = true
  envs = [
      "REMOTE=obs",
      "OBS_BUCKET_NAME=bucket","OBS_ACCESS_KEY_ID=AK","OBS_SECRET_ACCESS_KEY=SK","OBS_SERVER=https://xxx.myhuaweicloud.com"
    ]
  interval = "30s"

K8S 环境下需要调用 Kubernetes API 所以需要 RBAC 基于角色的访问控制

配置相关：

主机部署Kubernetes

其目录一般位于：

Linux/Mac: /usr/local/datakit/conf.d/datakit.conf
Windows: C:\Program Files\datakit\conf.d\datakit.conf

修改配置，如果没有在最后添加：

[remote_job]
  enable=true
  envs=["REMOTE=oss","OSS_BUCKET_HOST=<bucket_host>","OSS_ACCESS_KEY_ID=<key>","OSS_ACCESS_KEY_SECRET=<secret key>","OSS_BUCKET_NAME=<name>"]
  interval="100s"
  java_home=""

修改 DataKit yaml 文件，添加 RBAC 权限

---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: datakit
rules:
- apiGroups: ["rbac.authorization.k8s.io"]
  resources: ["clusterroles"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["nodes", "nodes/stats", "nodes/metrics", "namespaces", "pods", "pods/log", "events", "services", "endpoints", "persistentvolumes", "persistentvolumeclaims", "pods/exec"]
  verbs: ["get", "list", "watch", "create"]
- apiGroups: ["apps"]
  resources: ["deployments", "daemonsets", "statefulsets", "replicasets"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
  resources: ["jobs", "cronjobs"]
  verbs: [ "get", "list", "watch"]
- apiGroups: ["guance.com"]
  resources: ["datakits"]
  verbs: ["get","list"]
- apiGroups: ["monitoring.coreos.com"]
  resources: ["podmonitors", "servicemonitors"]
  verbs: ["get", "list"]
- apiGroups: ["metrics.k8s.io"]
  resources: ["pods", "nodes"]
  verbs: ["get", "list"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

---

在上面的配置中，添加了 "pod/exec"，其他的保持和 yaml 一致即可。

添加 remote_job 环境变量：

- name: ENV_REMOTE_JOB_ENABLE
  value: 'true'
- name: ENV_REMOTE_JOB_ENVS
  value: >-
    REMOTE=oss,OSS_BUCKET_HOST=<bucket host>,OSS_ACCESS_KEY_ID=<key>,OSS_ACCESS_KEY_SECRET=<secret key>,OSS_BUCKET_NAME=<name>
- name: ENV_REMOTE_JOB_JAVA_HOME
- name: ENV_REMOTE_JOB_INTERVAL
  value: 100s

配置说明：

enable ENV_REMOTE_JOB_ENABLE remote_job 功能开关。
envs ENV_REMOTE_JOB_ENVS 其中包括 host access key secret key bucket 信息，将获取到的 JVM dump 文件发送到 OSS 中，AWS 和 OBS 同理，更换环境变量即可。
interval ENV_REMOTE_JOB_INTERVAL DataKit 主动调用接口获取最新任务的时间间隔。
java_home ENV_REMOTE_JOB_JAVA_HOME 宿主机环境自动从环境变量（$JAVA_HOME）中获取，可以不用配置。

注意，使用的 Agent:dd-java-agent.jar 版本不应低于 v1.4.0-guance

Point 缓存¶

Version-1.28.0

Point 缓存目前有额外的性能问题，不建议使用。

为了优化 DataKit 高负载情况下的内存占用，可以开启 Point Pool 来缓解：

# datakit.conf
[point_pool]
    enable = true
    reserved_capacity = 4096

同时，DataKit 配置中可以开启 content_encoding = "v2" 的传输编码（ Version-1.32.0 已默认启用 v2），相比 v1，它的内存和 CPU 开销都更低。

Warning

在低负载（DataKit 内存占用 100MB 左右）的情况下，开启 point pool 会增加 DataKit 自身的内存占用。所谓的高负载，一般指占用内存在 2GB+ 的场景。同时开启后也能改善 DataKit 自身的 CPU 消耗

延伸阅读¶

宿主机安装: 在服务器上安装 DataKit

Kubernetes 安装: DaemonSet 安装 DataKit

DataKit 主配置¶

DataKit 主配置示例¶

HTTP 服务的配置¶

修改 HTTP 服务地址¶

使用 Unix domain socket¶

HTTP 请求频率控制¶

其它设置¶

HTTP API 访问控制¶

全局标签（Tag）修改¶

全局 Tag 在远程采集时的设置¶

DataKit 自身运行日志配置¶

高级配置¶

时间校准¶

IO 模块调参¶

资源限制¶

选举配置¶

DataWay 参数配置¶

WAL 队列配置¶

Sinker 配置¶

使用 Git 管理 DataKit 配置¶

本地设置 Pipeline 默认脚本¶

设置打开的文件描述符的最大值¶

采集器密码保护¶

远程任务¶

Point 缓存¶

延伸阅读¶

文档内容是否对您有帮助？ ×