Skip to content

DataKit Master Configuration


The DataKit master configuration is used to configure the running behavior of the DataKit itself.

Its directory is generally located in:

  • Linux/Mac: /usr/local/datakit/conf.d/datakit.conf
  • Windows: C:\Program Files\datakit\conf.d\datakit.conf

When DaemonSet is installed, the DataKit does not actually load the configuration, although this file exists in the corresponding directory. These matches are generated by injecting environment variables. For all of the following configurations, you can find the corresponding environment variable configuration in the Kubernetes deployment documentation.

Datakit Main Configure Sample

Datakit main configure is datakit.conf, here is the example sample(1.28.1):

datakit.conf
################################################
# Global configures
################################################
# Default enabled input list.
default_enabled_inputs = [
  "cpu",
  "disk",
  "diskio",
  "host_processes",
  "hostobject",
  "mem",
  "net",
  "swap",
  "system",
]

# enable_pprof: bool
# If pprof enabled, we can profiling the running datakit
enable_pprof = true
pprof_listen = "localhost:6060" # pprof listen

# protect_mode: bool, default false
# When protect_mode eanbled, we can set radical collect parameters, these may cause Datakit
# collect data more frequently.
protect_mode = true

# The user name running datakit. Generally for audit purpose. Default is root.
datakit_user = "root"

################################################
# ulimit: set max open-files limit(Linux only)
################################################
ulimit = 64000

################################################
# point_pool: use point pool for better memory usage(Experimental)
################################################
[point_pool]
  enable = false
  reserved_capacity = 4096

################################################
# DCA configure
################################################
[dca]
  # Enable or disable DCA
  enable = false

  # set DCA HTTP api server
  listen = "0.0.0.0:9531"

  # DCA client white list(raw IP or CIDR ip format)
  # Example: [ "1.2.3.4", "192.168.1.0/24" ]
  white_list = []

################################################
# Upgrader 
################################################
[dk_upgrader]
  # host address
  host = "0.0.0.0"

  # port number
  port = 9542 

################################################
# Pipeline
################################################
[pipeline]
  # IP database type, support iploc and geolite2
  ipdb_type = "iploc"

  # How often to sync remote pipeline
  remote_pull_interval = "1m"

  #
  # reftab configures
  #
  # Reftab remote HTTP URL(https/http)
  refer_table_url = ""

  # How often reftab sync the remote
  refer_table_pull_interval = "5m"

  # use sqlite to store reftab data to release memory usage
  use_sqlite = false
  # or use pure memory to cache the reftab data
  sqlite_mem_mode = false

  # Offload data processing tasks to post-level data processors.
  [pipeline.offload]
    receiver = "datakit-http"
    addresses = [
      # "http://<ip>:<port>"
    ]

################################################
# HTTP server(9529)
################################################
[http_api]

  # HTTP server address
  listen = "localhost:9529"

  # Disable 404 page to hide detailed Datakit info
  disable_404page = false

  # only enable these APIs. If list empty, all APIs are enabled.
  public_apis = []

  # Datakit server-side timeout
  timeout = "30s"
  close_idle_connection = false

  #
  # RUM related: we should port these configures to RUM inputs(TODO)
  #
  # When serving RUM(/v1/write/rum), extract the IP address from this HTTP header
  rum_origin_ip_header = "X-Forwarded-For"
  # When serving RUM(/v1/write/rum), only accept requests from these app-id.
  # If the list empty, all app's requests accepted.
  rum_app_id_white_list = []

  # only these domains enable CORS. If list empty, all domains are enabled.
  allowed_cors_origins = []

  # Start Datakit web server with HTTPS
  [http_api.tls]
    # cert = "path/to/certificate/file"
    # privkey = "path/to/private_key/file"

################################################
# io configures
################################################
[io]

  # How often Datakit flush data to dataway.
  # Datakit will upload data points if cached(in memory) points
  #  reached(>=) the max_cache_count or the flush_interval triggered.
  max_cache_count = 1000
  flush_workers   = 0 # default to (cpu_core * 2 + 1)
  flush_interval  = "10s"

  # Disk cache on datakit upload failed
  enable_cache = false
  # Cache all categories data point into disk
  cache_all = false
  # Max disk cache size(in GB), if cache size reached
  # the limit, old data dropped(FIFO).
  cache_max_size_gb = 10
  # Cache clean interval: Datakit will try to clean these
  # failed-data-point at specified interval.
  cache_clean_interval = "5s"

  # Data point filter configures.
  # NOTE: Most of the time, you should use web-side filter, it's a debug helper for developers.
  #[io.filters]
  #  logging = [
  #   "{ source = 'datakit' or f1 IN [ 1, 2, 3] }"
  #  ]
  #  metric = [
  #    "{ measurement IN ['datakit', 'disk'] }",
  #    "{ measurement CONTAIN ['host.*', 'swap'] }",
  #  ]
  #  object = [
  #    { class CONTAIN ['host_.*'] }",
  #  ]
  #  tracing = [
  #    "{ service = re("abc.*") AND some_tag CONTAIN ['def_.*'] }",
  #  ]

[recorder]
  enabled = false
  #path = "/path/to/point-data/dir"
  encoding = "v2"  # use protobuf-json format
  duration = "30m" # record for 30 minutes

  # only record these inputs, if empty, record all
  inputs = [
    #"cpu",
    #"mem",
  ]

  # only record these categoris, if empty, record all
  category = [
    #"logging",
    #"object",
  ]

################################################
# Dataway configure
################################################
[dataway]
  # urls: Dataway URL list
  # NOTE: do not configure multiple URLs here, it's a deprecated feature.
  urls = ["https://openway.guance.com?token=tkn_xxxxxxxxxxx"]

  # Dataway HTTP timeout
  timeout_v2 = "30s"

  # max_retry_count specifies at most how many times the data sending operation will be tried when it fails,
  # valid minimum value is 1 (NOT 0) and maximum value is 10.
  max_retry_count = 4

  # The interval between two retry operation, valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h"
  retry_delay = "1s"

  # HTTP Proxy(IP:Port)
  http_proxy = ""

  max_idle_conns   = 0       # limit idle TCP connections for HTTP request to Dataway
  enable_httptrace = false   # enable trace HTTP metrics(connection/NDS/TLS and so on)
  idle_timeout     = "90s"   # not-set, default 90s

  # HTTP body content type, other candidates are(case insensitive):
  #  - v1: line-protocol
  #  - v2: protobuf
  content_encoding = "v1"

  # Enable GZip to upload point data.
  #
  # do NOT disable gzip or your get large network payload.
  gzip = true

  max_raw_body_size = 10485760 # max body size(before gizp) in bytes

  # Customer tag or field keys that will extract from exist points
  # to build the X-Global-Tags HTTP header value.
  global_customer_keys = []
  enable_sinker        = false # disable sinker

################################################
# Datakit logging configure
################################################
[logging]

  # log path
  log = "/var/log/datakit/log"

  # HTTP access log
  gin_log = "/var/log/datakit/gin.log"

  # level level(info/debug)
  level = "info"

  # Disable log color
  disable_color = false

  # log rotate size(in MB)
  # DataKit will always keep at most n+1(n backup log and 1 writing log) splited log files on disk.
  rotate = 32

  # Upper limit count of backup log
  rotate_backups = 5

################################################
# Global tags
################################################
# We will try to add these tags to every collected data point if these
# tags do not exist in orignal data.
#
# NOTE: we can get the real IP of current note, we just need
# to set "$datakit_ip" or "__datakit_ip" here. Same for the hostname.
[global_host_tags]
  ip   = "$datakit_ip"
  host = "$datakit_hostname"

[election]
  # Enable election
  enable = false

  # Election namespace.
  # NOTE: for single workspace, there can be multiple election namespace.
  namespace = "default"

  # If enabled, every data point will add a tag with election_namespace = <your-election-namespace>
  enable_namespace_tag = false

  # Like global_host_tags, but only for data points that are remotely collected(such as MySQL/Nginx).
  [election.tags]
    #  project = "my-project"
    #  cluster = "my-cluster"

###################################################
# Tricky: we can rename the default hostname here
###################################################
[environments]
  ENV_HOSTNAME = ""

################################################
# resource limit configures
################################################
[resource_limit]

  # enable or disable resource limit
  enable = true

  # Linux only, cgroup path
  path = "/datakit"

  # set max CPU usage(%, max 100.0, no matter how many CPU cores here)
  cpu_max = 20.0

  # set max memory usage(MB)
  mem_max_mb = 4096

################################################
# git_repos configures
################################################

# We can hosting all input configures on git server
[git_repos]
  # git pull interval
  pull_interval = "1m"

  # git repository settings
  [[git_repos.repo]]
    # enable the repository or not
    enable = false

    # the branch name to pull
    branch = "master"

    # git repository URL. There are 3 formats here:
    #   - HTTP(s): such as "https://github.datakit.com/path/to/datakit-conf.git"
    #   - Git: such as "git@github.com:path/to/datakit.git"
    #   - SSH: such as "ssh://git@github.com:9000/path/to/repository.git"
    url = ""

    # For formats Git and SSH, we need extra configures:
    ssh_private_key_path = ""
    ssh_private_key_password = ""

Configuration of HTTP Service

DataKit opens an HTTP service to receive external data or provide basic data services to the outside world.

Modify the HTTP Service Address

The default HTTP service address is localhost:9529, and if port 9529 is occupied, or you want to access the HTTP service of DataKit from outside (for example, you want to receive RUM or Tracing data), you can modify it to:

[http_api]
   listen = "0.0.0.0:<other-port>"
   # or using IPV6 address
   # listen = "[::]:<other-port>"

NOTE: IPv6 need Datakit version 1.5.7.

Using Unix Domain Socket

Datakit supports UNIX domain sockets access. Open it as follows: The listen field is configured to the full path to a file that does not exist. Here, for example, sockcan be any file name.

[http_api]
   listen = "/tmp/datakit.sock"
After the configuration is complete, you can use thecurlcommand to test whether the configuration is successful:sudo curl --no-buffer -XGET --unix-socket /tmp/datakit.sock http:/localhost/v1/ping. For more information on the test commands forcurl`, see here.

HTTP Request Frequency Control

As DataKit needs to receive a large number of external data writes, in order to avoid causing huge overhead to the host node, the following HTTP configuration can be modified (it is not turned on by default):

[http_api]
  request_rate_limit = 1000.0 # Limit ingeach HTTP API to receive only 1000 requests per second

Other Settings

[http_api]
    close_idle_connection = true # Close idle connections
    timeout = "30s"              # Set server-side HTTP timeout

See here.

Global Tag Modification

Version-1.4.6

DataKit allows you to configure global labels for all the data it collects. Global labels fall into two categories:

  • Host class global variable: The collected data is closely related to the current host, such as CPU/memory and other metric data.
  • Environment class global variable: The collected data comes from a public entity, such as MySQL/Redis. These collections are generally elected, so the host-related global tag will not be carried on these data.
[global_host_tags]
  ip         = "__datakit_ip"
  host       = "__datakit_hostname"

[election]
  [election.tags]
    project = "my-project"
    cluster = "my-cluster"

When adding a global Tag, there are several places to pay attention to:

  • These global Tag values are available using several variables currently supported by DataKit (both the double underscore(__)prefix and $ are available):

    • __datakit_ip/$datakit_ip: The tag value is set to the first master network card IP that the DataKit obtains.
    • __datakit_hostname/$datakit_hostname: Tag value is set to the hostname of the DataKit.
  • Do not have any metric Field in the global Tag because of the DataKit data transmission protocol restrictions, otherwise the data processing will fail due to protocol violation. See the field list of specific collectors for details. Of course, don't add too many tags, and there are limits to the length of Key and Value of each Tag.

  • If the collected data has a Tag with the same name, the DataKit will not append the global Tag configured here.
  • Even if global_host_tags does not configure any global tags, DataKit will still try to add a global Tag with host=$HOSTNAME on all the data.
  • We can set same tags to both global tags, for example, set project = "my-project" to both global_host_tags and [election.tags]

Settings of Global Tag in Remote Collection

Because DataKit will append the label host=<host name where DataKit is located> to all collected data by default, but this default appended host will cause trouble in some cases.

Take MySQL as an example, if MySQL is not on the DataKit machine, but you want this host tag to be the real hostname of MySQL being collected (or other identification fields of the cloud database), not the hostname of DataKit.

In this case, we can bypass the global tag on DataKit in two ways:

  • In the specific collector, there is generally the following configuration, and we can add a Tag here. For example, if we don't want DataKit to add the Tag host=xxx by default, we can overwrite this Tag here, taking MySQL as an example:
[[inputs.mysql.tags]]
  host = "real-mysql-host-name" 
Tip

Since 1.4.20, DataKit defaults to fields such as IP/host of the collected service as host, so this problem will be improved after upgrading. It is recommended that you upgrade to this version to avoid this problem.

DataKit Own Running Log Configuration

DataKit has two own logs, one is its own run log(/var/log/datakit/log)and the other is HTTP Access log(/var/log/datakit/gin.log).

The default logging level for DataKit is info. Edit datakit.conf to modify the log level and slice size:

[logging]
  level = "debug" # correct info to debug
  rotate = 32     # each log slice is 32MB
  • level: When you set it to debug, you can see more logs (currently only the debug/info levels are supported).
  • rotate: DataKit slices the log by default. The default slice size is 32MB, and there are 6 slices in total (1 current write slice plus 5 cut slices, and the number of slices is not yet supported). If you dislike that DataKit logs take up too much disk space (maximum 32 x 6 = 192MB), reduce the rotate size (for example, change it to 4 in MB). HTTP access logs are automatically cut in the same way.

Advanced Configuration

The following content involves some advanced configuration. If you are not sure about the configuration, it is recommended to consult our technical experts.

Point Pool

Version-1.28.0 · Experimental

To optimize Datakit's memory usage under high load conditions, we can enable Point Pool to alleviate the pressure:

# datakit.conf
[point_pool]
    enable = true
    reserved_capacity = 4096

We can also enable content_encoding = "v2" under Dataway configure, with v2 encoding, it has lower memory and CPU overhead compared to v1.

Attention

While Datakit under low load(with a memory footprint of around), enable Point-Pool will eat more memory(we need more memory to cache unused data), but not excessively. The term "high load" typically refer to scenarios where memory consumption reach to 2GB or more. Enabling Point-Pool not only helps to memory usage but also improves Datakit's CPU consumption.

IO Module Parameter Adjustment

Version-1.4.8 · Experimental

In some cases, the data collection amount of DataKit is very large. If the network bandwidth is limited, some data collection may be interrupted or discarded. You can mitigate this problem by configuring some parameters of the io module:

[io]
  feed_chan_size  = 4096  # length of data processing queue (a job typically has multiple points)
  max_cache_count = 512   # data bulk sending points, beyond which sending is triggered in the cache
  flush_interval  = "10s" # threshold for sending data at least once every 10s
  flush_workers   = 8     # upload workers, default CPU-core * 2 + 1

See corresponding description in k8s for blocking mode

See here

IO Disk Cache

Version-1.5.8 · Experimental

When DataKit fails to send data, disk cache can be turned on in order not to lose critical data. The purpose of disk cache is to temporarily store the data on disk when upload Dataway failed, and then fetch data from disk and upload again later.

[io]
  enable_cache      = true   # turn on disk caching
  cache_all         = false  # cache all categories(default metric,object and dial-testing data point not cached)
  cache_max_size_gb = 5 # specify a disk size of 5GB

See here


Attention

The cache_max_size_gb used to control max disk capacity of each data category. For there are 10 categories, if each on configured with 5GB, the max disk usage may reach to 50GB.

Resource Limit

Because the amount of data processed on the DataKit cannot be estimated, if the resources consumed by the DataKit are not physically limited, it may consume a large amount of resources of the node where it is located. Here we can limit it with the help of cgroup in Linux or job object in Windows, which has the following configuration in datakit.conf:

[resource_limit]
  path = "/datakit" # Linux cgroup restricts directories, such as /sys/fs/cgroup/memory/datakit, /sys/fs/cgroup/cpu/datakit

  # Maximum CPU utilization allowed (percentile)
  cpu_max = 20.0

  # Allows 4GB of memory (memory + swap) by default
  # If set to 0 or negative, memory limits are not enabled
  mem_max_mb = 4096 

If the DataKit exceeds the memory limit, it will be forcibly killed by the operating system. The following results can be seen through the command, and the service needs to be started manually at this time.

$ systemctl status datakit 
 datakit.service - Collects data and upload it to DataFlux.
     Loaded: loaded (/etc/systemd/system/datakit.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: signal) since Fri 2022-02-30 16:39:25 CST; 1min 40s ago
    Process: 3474282 ExecStart=/usr/local/datakit/datakit (code=killed, signal=KILL)
   Main PID: 3474282 (code=killed, signal=KILL)
Attention
  • resource restriction will only be turned on by default during host installation.
  • resource limit only supports CPU usage and memory usage (mem + swap) controls, and only supports Linux and Windows ( Version-1.15.0) operating systems.
  • CPU usage controls is not supported in these windows systems: Windows 7, Windows Server 2008 R2, Windows Server 2008, Windows Vista, Windows Server 2003 and Windows XP.
  • When adjusting resource limit as a non-root user, it is essential to reinstall the service.
Tip

Datakit supports cgroup V2 from version 1.5.8. If you are unsure of the cgroup version, you can use this command mount | grep cgroup to check.

Election Configuration

See here

Dataway Settings

Dataway got following settings to be configured:

  • timeout: The timeout for request data to Dataway. The default value is 30s
  • max_retry_count: Sets the number of retries to request Dataway (4 by default) Version-1.17.0
  • retry_delay : Set the basic step of the retry interval. The default value is 200ms. The so-called basic step is 200ms for the first time, 400ms for the second time, 800ms for the third time, and so on (in increments of $2^n$) Version-1.17.0
  • max_raw_body_size: Set the maximum size of a single uploaded package (before compression), in bytes Version-1.17.1
  • content_encoding : v1 or v2 can be selected Version-1.17.1
    • v1 is line-protocol (default: v1)
    • v2 is the Protobuf protocol. Compared with v1, it has better performance in all aspects

See here for configuration under Kubernetes.

Dataway Sinker

See here

Managing DataKit Configuration with Git

Because the configuration of various collectors in DataKit is text type, it takes a lot of energy to modify and take effect one by one. Here we can use Git to manage these configurations, with the following advantages:

  • Automatically synchronize the latest configuration from the remote Git repository and take effect automatically.
  • Git has its own version management, which can effectively track the change history of various configurations.

When you install DataKit(supported by DaemonSet installation and host installation), you can specify the Git configuration repository.

Manually Configure Git Administration

Datakit supports the use of git to manage collector configurations, Pipeline, and Python scripts. In datakit.conf, find the git_repos location and edit the following:

[git_repos]
  pull_interval = "1m" # Synchronize configuration interval, that is, synchronize once every 1 minute

  [[git_repos.repo]]
    enable = false   # Do not enable the repo

    ###########################################
    # Three protocols supported by Git address: http/git/ssh
    ###########################################
    url = "http://username:password@github.com/path/to/repository.git"

    # The following two protocols (git/ssh) need to be configured with key-path and key-password
    # url = "git@github.com:path/to/repository.git"
    # url = "ssh://git@github.com:9000/path/to/repository.git"
    # ssh_private_key_path = "/Users/username/.ssh/id_rsa"
    # ssh_private_key_password = "<YOUR-PASSSWORD>"

    branch = "master" # Specify git branch

Note: After Git synchronization is turned on, the collector configuration in the original conf.d directory will no longer take effect (except datakit.conf ).

Applying Git-managed Pipeline Sample

We can add Pipeline to the collector configuration to cut the logs of related services. When Git synchronization is turned on, both the Pipeline that comes with DataKit and the Pipeline synchronized by Git can be used. In the configuration of Nginx collector, a configuration example of Pipeline.

[[inputs.nginx]]
    ...
    [inputs.nginx.log]
    ...
    pipeline = "my-nginx.p" # Where to load my-nginx.p, see the "constraint" description below

Git-managed Usage Constraints

The following constraints must be followed when using Git:

  • Create a new conf.d folder in git repo, and put the DataKit collector configuration below
  • Create a new pipeline folder in git repo, and place the Pipeline file below
  • Create a new python.d folder in git repo, and place the Python script file below

The following is illustrated by legend:

datakit root directory
├── conf.d
├── data
├── pipeline # top-level Pipeline script
├── python.d # top-level python.d script
├── externals
└── gitrepos
    ├── repo-1  # warehouse 1
       ├── conf.d    # dedicated to store collector configuration
       ├── pipeline  # dedicated to storing pipeline cutting scripts
          └── my-nginx.p # legal pipeline script
          └── 123     # illegal Pipeline subdirectory, and the files under it will not take effect
              └── some-invalid.p
       └── python.d    store python.d scripts
           └── core
    └── repo-2  # warehouse 2
        ├── ...

The lookup priority is defined as follows:

  1. Find the specified file names one by one in the git_repos order configured in datakit.conf (it is an array that can configure multiple Git repositories), and return the first one if found. For example, look for my-nginx.p. If it is found under pipeline in the first repository directory, it will prevail. Even if there is my-nginx.p with the same name in the second repository, it will not be selected.

  2. If not found in git_repos , go to the <Datakit Installation Directory>/pipeline directory for the Pipeline script, or go to the <Datakit Installation Directory>/python.d directory for the Python script.

Set the Maximum Value of Open File Descriptor

In a Linux environment, you can configure the ulimit entry in the Datakit main configuration file to set the maximum number of open files for Datakit, as follows:

ulimit = 64000

ulimit is configured to 64000 by default.

CPU Utilization Rate Description for Resource Limit

CPU utilization is on a percentage basis (maximum 100.0). For an 8-core CPU, if the limit cpu_max is 20.0 (that is, 20%), the maximum CPU consumption of DataKit, will be displayed as about 160% on the top command.

Extended Readings

Feedback

Is this page helpful? ×