Various Other Tool Usages¶

DataKit has many different small tools built-in for daily use. You can view the command-line help of DataKit through the following command:

datakit help

Note: Due to differences between different platforms, the specific help content may vary.

If you want to see how a specific command is used (such as dql), you can use the following command:

$ datakit help dql
usage: datakit dql [options]

DQL used to query data. If no option specified, query interactively. Other available options:

      --auto-json      pretty output string if field/tag value is JSON
      --csv string     Specify the directory
  -F, --force          overwrite csv if file exists
  -H, --host string    specify datakit host to query
  -J, --json           output in json format
      --log string     log path (default "/dev/null")
  -R, --run string     run single DQL
  -T, --token string   run query for specific token(workspace)
  -V, --verbose        verbosity mode

Debugging Commands¶

Debugging the Blacklist¶

Version-1.14.0

To debug whether a piece of data will be filtered by the centrally configured blacklist, you can use the following command:

Linux/macOSWindows

$ datakit debug --filter=/usr/local/datakit/data/.pull --data=/path/to/lineproto.data

Dropped

    ddtrace,http_url=/webproxy/api/online_status,service=web_front f1=1i 1691755988000000000

By 7th rule(cost 1.017708ms) from category "tracing":

    { service = 'web_front' and ( http_url in [ '/webproxy/api/online_status' ] )}

PS > datakit.exe debug --filter 'C:\Program Files\datakit\data\.pull' --data '\path\to\lineproto.data'

Dropped

    ddtrace,http_url=/webproxy/api/online_status,service=web_front f1=1i 1691755988000000000

By 7th rule(cost 1.017708ms) from category "tracing":

    { service = 'web_front' and ( http_url in [ '/webproxy/api/online_status' ] )}

The above output indicates that the data in the file lineproto.data is matched by the 7th rule (counting from 1) in the tracing category in the .pull file. Once matched, this piece of data will be discarded.

Obtaining File Paths Using glob Rules¶

Version-1.8.0

In log collection, log paths can be configured using glob rules.

You can debug the glob rules using DataKit. You need to provide a configuration file, and each line of the file is a glob statement.

An example of the configuration file is as follows:

$ cat glob-config
/tmp/log-test/*.log
/tmp/log-test/**/*.log

A complete command example is as follows:

$ datakit debug --glob-conf glob-config
============= glob paths ============
/tmp/log-test/*.log
/tmp/log-test/**/*.log

========== found the files ==========
/tmp/log-test/1.log
/tmp/log-test/logfwd.log
/tmp/log-test/123/1.log
/tmp/log-test/123/2.log

Matching Text with Regular Expressions¶

Version-1.8.0

In log collection, multiline log collection can be achieved by configuring regular expressions.

You can debug the regular expression rules using DataKit. You need to provide a configuration file, and the first line of the file is the regular expression, and the remaining content is the text to be matched (which can be multiple lines).

An example of the configuration file is as follows:

$ cat regex-config
^\d{4}-\d{2}-\d{2}
2020-10-23 06:41:56,688 INFO demo.py 1.0
2020-10-23 06:54:20,164 ERROR /usr/local/lib/python3.6/dist-packages/flask/app.py Exception on /0 [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
ZeroDivisionError: division by zero
2020-10-23 06:41:56,688 INFO demo.py 5.0

A complete command example is as follows:

$ datakit debug --regex-conf regex-config
============= regex rule ============
^\d{4}-\d{2}-\d{2}

========== matching results ==========
  Ok:  2020-10-23 06:41:56,688 INFO demo.py 1.0
  Ok:  2020-10-23 06:54:20,164 ERROR /usr/local/lib/python3.6/dist-packages/flask/app.py Exception on /0 [GET]
Fail:  Traceback (most recent call last):
Fail:    File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
Fail:      response = self.full_dispatch_request()
Fail:  ZeroDivisionError: division by zero
  Ok:  2020-10-23 06:41:56,688 INFO demo.py 5.0

Viewing the Running Status of DataKit¶

For the usage of monitor, please refer to here.

Checking the Correctness of Collector Configuration¶

After editing the collector configuration file, there may be some configuration errors (such as incorrect configuration file format). You can check whether it is correct through the following command:

datakit check --config
------------------------
checked 13 conf, all passing, cost 22.27455ms

Viewing Workspace Information¶

To facilitate viewing workspace information on the server side, DataKit provides the following command to view it:

datakit tool --workspace-info
{
  "token": {
    "ws_uuid": "wksp_2dc431d6693711eb8ff97aeee04b54af",
    "bill_state": "normal",
    "ver_type": "pay",
    "token": "tkn_2dc438b6693711eb8ff97aeee04b54af",
    "db_uuid": "ifdb_c0fss9qc8kg4gj9bjjag",
    "status": 0,
    "creator": "",
    "expire_at": -1,
    "create_at": 0,
    "update_at": 0,
    "delete_at": 0
  },
  "data_usage": {
    "data_metric": 96966,
    "data_logging": 3253,
    "data_tracing": 2868,
    "data_rum": 0,
    "is_over_usage": false
  }
}

Debugging KV Files¶

When the collector configuration file is configured using the KV template, if you need to debug, you can use the following command for debugging.

datakit tool --parse-kv-file conf.d/host/cpu.conf --kv-file data/.kv

[[inputs.cpu]]
  ## Collect interval, default is 10 seconds. (optional)
  interval = '10s'

  ## Collect CPU usage per core, default is false. (optional)
  percpu = false

  ## Setting disable_temperature_collect to false will collect cpu temperature stats for linux. (deprecated)
  # disable_temperature_collect = false

  ## Enable to collect core temperature data.
  enable_temperature = true

  ## Enable gets average load information every five seconds.
  enable_load5s = true

[inputs.cpu.tags]
  kv = "cpu_kv_value3"

Viewing Cloud Attribute Data¶

If the machine where DataKit is installed is a cloud server (currently supports aliyun/tencent/aws/hwcloud/azure), you can view some cloud attribute data through the following command. For example (marked as - means the field is invalid):

datakit tool --show-cloud-info aws

           cloud_provider: aws
              description: -
     instance_charge_type: -
              instance_id: i-09b37dc1xxxxxxxxx
            instance_name: -
    instance_network_type: -
          instance_status: -
            instance_type: t2.nano
               private_ip: 172.31.22.123
                   region: cn-northwest-1
        security_group_id: launch-wizard-1
                  zone_id: cnnw1-az2

Parsing Line Protocol Data¶

Version-1.5.6

You can parse line protocol data through the following command:

datakit tool --parse-lp /path/to/file
Parse 201 points OK, with 2 measurements and 201 time series

It can be output in JSON format:

datakit tool --parse-lp /path/to/file --json
{
  "measurements": {  # List of metric sets
    "testing": {
      "points": 7,
      "time_series": 6
    },
    "testing_module": {
      "points": 195,
      "time_series": 195
    }
  },
  "point": 202,        # Total number of points
  "time_serial": 201   # Total number of timelines
}

Data Recording and Replay¶

Version-1.19.0

Data import is mainly used to enter existing collected data. When demonstrating or testing, additional collection is not required.

Enabling Data Recording¶

In datakit.conf, you can enable the data recording function. After enabling, DataKit will record the data to the specified directory for subsequent import:

[recorder]
  enabled  = true
  path     = "/path/to/recorder"     # Absolute path, by default in the <DataKit installation directory>/recorder directory
  encoding = "v2"                    # Use protobuf-JSON format (xxx.pbjson), and you can also choose v1 (xxx.lp) in line protocol form (the former is more readable and supports more data types)
  duration = "10m"                   # Recording duration, starting from the startup of DataKit
  inputs   = ["cpu", "mem"]          # Record data of specified collectors (based on the names shown in the *Inputs Info* panel of monitor), and if empty, it means recording data of all collectors
  categories = ["logging", "metric"] # Recording types, and if empty, it means recording all data types

After the recording starts, the directory structure is roughly as follows (showing the pbjson format of time-series data here):

[ 416] /usr/local/datakit/recorder/
├── [  64]  custom_object
├── [  64]  dynamic_dw
├── [  64]  keyevent
├── [  64]  logging
├── [  64]  network
├── [  64]  object
├── [  64]  profiling
├── [  64]  rum
├── [  64]  security
├── [  64]  tracing
└── [1.9K]  metric
    ├── [1.2K]  cpu.1698217783322857000.pbjson
    ├── [1.2K]  cpu.1698217793321744000.pbjson
    ├── [1.2K]  cpu.1698217803322683000.pbjson
    ├── [1.2K]  cpu.1698217813322834000.pbjson
    └── [1.2K]  cpu.1698218363360258000.pbjson

12 directories, 59 files

Warning

After the data recording is completed, remember to turn off this function (enable = false). Otherwise, every time DataKit starts, recording will be launched, which may consume a large amount of disk space.
The collector name is not exactly the same as the name in the collector configuration ([[inputs.some-name]]), but the name shown in the first column of the Inputs Info panel of monitor. The name of some collectors may be like this: logging/<some-pod-name>. Here, the data directory it stores is /usr/local/datakit/recorder/logging/logging-some-pod-name.1705636073033197000.pbjson, and the / in the collector name is replaced with - (to avoid an extra directory structure).

Data Replay¶

After DataKit records the data, you can save the data in this directory using Git or other methods (make sure to keep the existing directory structure). Then, you can import these data into Guance through the following command:

$ datakit import -P /usr/local/datakit/recorder -D https://openway.guance.com?token=tkn_xxxxxxxxx

> Uploading "/usr/local/datakit/recorder/metric/cpu.1698217783322857000.pbjson"(1 points) on metric...
+1h53m6.137855s ~ 2023-10-25 15:09:43.321559 +0800 CST
> Uploading "/usr/local/datakit/recorder/metric/cpu.1698217793321744000.pbjson"(1 points) on metric...
+1h52m56.137881s ~ 2023-10-25 15:09:53.321533 +0800 CST
> Uploading "/usr/local/datakit/recorder/metric/cpu.1698217803322683000.pbjson"(1 points) on metric...
+1h52m46.137991s ~ 2023-10-25 15:10:03.321423 +0800 CST
...
Total upload 75 kB bytes ok

Although the recorded data contains absolute timestamps (in nanoseconds), when playing back, DataKit will automatically shift these data to the current time (retaining the relative time intervals between data points), making it look like newly collected data.

You can obtain more help information about data import through the following command:

$ datakit help import

usage: datakit import [options]

Import used to play recorded history data to Guance. Available options:

  -D, --dataway strings   dataway list
      --log string        log path (default "/dev/null")
  -P, --path string       point data path (default "/usr/local/datakit/recorder")

Warning

For RUM data, if there is no corresponding APP ID in the target workspace for playback, the data cannot be written. You can create a new application in the target workspace, change the APP ID to be consistent with that in the recorded data, or replace the APP ID in the existing recorded data with the APP ID of the corresponding RUM application in the target workspace.

Others¶

Telegraf Integration¶

Note: Before using Telegraf, it is recommended to confirm whether DataKit can meet the expected data collection. If DataKit already supports it, it is not recommended to use Telegraf for collection, as it may cause data conflicts and usage troubles.

Install the Telegraf integration

datakit install --telegraf

Start Telegraf

cd /etc/telegraf
cp telegraf.conf.sample telegraf.conf
telegraf --config telegraf.conf

For usage matters of Telegraf, refer to here.

Security Checker Integration¶

Install the Security Checker

datakit install --scheck

After a successful installation, it will run automatically. For the specific usage of the Security Checker, refer to here

eBPF Integration¶

Install the DataKit eBPF collector. Currently, it only supports the linux/amd64 | linux/arm64 platforms. For the usage instructions of the collector, see DataKit eBPF Collector

datakit install --ebpf

If the prompt open /usr/local/datakit/externals/datakit-ebpf: text file busy appears, execute this command after stopping the DataKit service.

Warning

This command has been removed in Version-1.5.6. The eBPF integration is built-in by default in the new version.

Update IP Database¶

Host InstallationKubernetes(yaml)Kubernetes(Helm)

You can directly use the following command to install/update the IP geographic information database (here you can choose another IP address library geolite2, just replace iploc with geolite2):

datakit install --ipdb iploc

After updating the IP geographic information database, modify the datakit.conf configuration:

[pipeline]
  ipdb_type = "iploc"

Restart DataKit to take effect
Test whether the IP library takes effect

datakit tool --ipinfo 1.2.3.4
        ip: 1.2.3.4
      city: Brisbane
  province: Queensland
   country: AU
       isp: unknown

If the installation fails, the output is as follows:

datakit tool --ipinfo 1.2.3.4
       isp: unknown
        ip: 1.2.3.4
      city: 
  province: 
   country:

Modify datakit.yaml and uncomment the content between the 4 places marked with ---iploc-start and ---iploc-end.
Reinstall DataKit:

kubectl apply -f datakit.yaml

# Ensure the DataKit container is started
kubectl get pod -n datakit

Enter the container and test whether the IP library takes effect

datakit tool --ipinfo 1.2.3.4
        ip: 1.2.3.4
      city: Brisbane
  province: Queensland
   country: AU
       isp: unknown

If the installation fails, the output is as follows:

datakit tool --ipinfo 1.2.3.4
       isp: unknown
        ip: 1.2.3.4
      city: 
  province:
   country:

Add --set iploc.enable when deploying with Helm

helm install datakit datakit/datakit -n datakit \
    --set datakit.dataway_url="https://openway.guance.com?token=<YOUR-TOKEN>" \
    --set iploc.enable true \
    --create-namespace

For deployment matters of Helm, refer to here.

Enter the container and test whether the IP library takes effect

datakit tool --ipinfo 1.2.3.4
        ip: 1.2.3.4
      city: Brisbane
  province: Queensland
   country: AU
       isp: unknown

If the installation fails, the output is as follows:

datakit tool --ipinfo 1.2.3.4
       isp: unknown
        ip: 1.2.3.4
      city: 
  province:
   country:

Automatic Command Completion¶

DataKit 1.2.12 supports this completion, and only two Linux distributions, Ubuntu and CentOS, have been tested. It is not supported on Windows and Mac.

During the use of the DataKit command line, due to the large number of command line parameters, the command prompt and completion function have been added here.

Most mainstream Linux systems support command completion. Taking Ubuntu and CentOS as examples, if you want to use the command completion function, you can additionally install the following software packages:

Ubuntu: apt install bash-completion
CentOS: yum install bash-completion bash-completion-extras

If these software packages are already installed before installing DataKit, the command completion function will be automatically included during the DataKit installation. If these software packages are updated after installing DataKit, you can execute the following operation to install the DataKit command completion function:

datakit tool --setup-completer-script

Completion usage example:

$ datakit <tab> # Enter \tab to get the following commands
dql       help      install   monitor   pipeline  run       service   tool

$ datakit dql <tab> # Enter \tab to get the following options
--auto-json   --csv         -F,--force    --host        -J,--json     --log         -R,--run      -T,--token    -V,--verbose

All the commands mentioned below can be operated in this way.

Obtaining the Automatic Completion Script¶

If your Linux system is not Ubuntu or CentOS, you can obtain the completion script through the following command, and then add it one by one according to the shell completion method of the corresponding platform.

# Export the completion script to the local datakit-completer.sh file
datakit tool --completer-script > datakit-completer.sh

Various Other Tool Usages¶

Debugging Commands¶

Debugging the Blacklist¶

Obtaining File Paths Using glob Rules¶

Matching Text with Regular Expressions¶

Viewing the Running Status of DataKit¶

Checking the Correctness of Collector Configuration¶

Viewing Workspace Information¶

Debugging KV Files¶

Viewing Cloud Attribute Data¶

Parsing Line Protocol Data¶

Data Recording and Replay¶

Enabling Data Recording¶

Data Replay¶

Others¶

Telegraf Integration¶

Security Checker Integration¶

eBPF Integration¶

Update IP Database¶

Automatic Command Completion¶

Obtaining the Automatic Completion Script¶

Is this page helpful? ×