Host Installation¶
This article describes the basic installation of DataKit.
Register/log in to Guance Cloud¶
The browser visits the Guance Cloud registration portal, fills in the corresponding information, and then logs in to Guance Cloud.
Get the Installation Command¶
Log in to the workspace, click "Integration" on the left and select "Datakit" at the top, and you can see the installation commands of various platforms.
Note that the following Linux/Mac/Windows installer can automatically identify the hardware platform (arm/x86, 32bit/64bit) without making a hardware platform selection.
The installation command supports bash
and ash
( Version-1.14.0), and the command is roughly as follows:
bash
DK_DATAWAY=https://openway.guance.com?token=<TOKEN> bash -c "$(curl -L https://static.guance.com/datakit/install.sh)"
ash
DK_DATAWAY=https://openway.guance.com?token=<TOKEN> ash -c "$(curl -L https://static.guance.com/datakit/install.sh)"
After the installation is completed, you will see a prompt that the installation is successful at the terminal.
Installation on Windows requires a Powershell command line installation and must run Powershell as an administrator. Press the Windows key, enter powershell to see the pop-up powershell icon, and right-click and select "Run as an administrator".
Remove-Item -ErrorAction SilentlyContinue Env:DK_*;
$env:DK_DATAWAY="https://openway.guance.com?token=<TOKEN>";
Set-ExecutionPolicy Bypass -scope Process -Force;
Import-Module bitstransfer;
start-bitstransfer -source https://static.guance.com/datakit/install.ps1 -destination .install.ps1;
powershell ./.install.ps1;
Install DataKit lite¶
You can specify the environment variable DK_LITE
to install DataKit lite ( Version-1.14.0):
Remove-Item -ErrorAction SilentlyContinue Env:DK_*;
$env:DK_DATAWAY="https://openway.guance.com?token=<TOKEN>";
$env:DK_LITE="1";
Set-ExecutionPolicy Bypass -scope Process -Force;
Import-Module bitstransfer;
start-bitstransfer -source https://static.guance.com/datakit/install.ps1 -destination .install.ps1;
powershell ./.install.ps1;
DataKit lite only contains collectors as below:
Collector Name | Description |
---|---|
cpu |
Collect the CPU usage of the host |
disk |
Collect disk occupancy |
diskio |
Collect the disk IO status of the host |
mem |
Collect the memory usage of the host |
swap |
Collect Swap memory usage |
system |
Collect the load of host operating system |
net |
Collect host network traffic |
host_processes |
Collect the list of resident (surviving for more than 10min) processes on the host |
hostobject |
Collect basic information of host computer (such as operating system information, hardware information, etc.) |
DataKit(dk) | Collect Datakit running metrics |
RUM(rum) | Collect user access monitoring data |
Net dialtesting(dialtesting) | Collect the data generated by dialing test |
Prom (prom) | Collect data exposed by Prometheus Exporters |
logging | Collect file log data |
Install DataKit eBPF Span Linker Version¶
You can specify the environment variable DK_ELINKER
to install DataKit ELinker ( Version-1.30.0):
Remove-Item -ErrorAction SilentlyContinue Env:DK_*;
$env:DK_DATAWAY="https://openway.guance.com?token=<TOKEN>";
$env:DK_ELINKER="1";
Set-ExecutionPolicy Bypass -scope Process -Force;
Import-Module bitstransfer;
start-bitstransfer -source https://static.guance.com/datakit/install.ps1 -destination .install.ps1;
powershell ./.install.ps1;
DataKit ELinker only contains collectors as below:
Collector Name | Description |
---|---|
cpu |
Collect the CPU usage of the host |
disk |
Collect disk occupancy |
diskio |
Collect the disk IO status of the host |
ebpftrace |
Receive eBPF trace span and link these spans to generate trace id |
mem |
Collect the memory usage of the host |
swap |
Collect Swap memory usage |
system |
Collect the load of host operating system |
net |
Collect host network traffic |
hostobject |
Collect basic information of host computer (such as operating system information, hardware information, etc.) |
DataKit(dk) |
Collect Datakit running metrics |
Install Specific Version¶
We can install specific DataKit version, for example 1.2.3:
DK_DATAWAY=https://openway.guance.com?token=<TOKEN> bash -c "$(curl -L https://static.guance.com/datakit/install-1.2.3.sh)"
And the same as Windows:
Remove-Item -ErrorAction SilentlyContinue Env:DK_*;
$env:DK_DATAWAY="https://openway.guance.com?token=<TOKEN>";
Set-ExecutionPolicy Bypass -scope Process -Force;
Import-Module bitstransfer;
start-bitstransfer -source https://static.guance.com/datakit/install-1.2.3.ps1 -destination .install.ps1;
powershell ./.install.ps1;
Additional Supported Environment Variable¶
If you need to define some DataKit configuration during the installation phase, you can add environment variables to the installation command, just append them before DK_DATAWAY
For example, append the DK_NAMESPACE
setting:
Remove-Item -ErrorAction SilentlyContinue Env:DK_*;
$env:DK_DATAWAY="https://openway.guance.com?token=<TOKEN>";
$env:DK_NAMESPACE="<namespace>";
Set-ExecutionPolicy Bypass -scope Process -Force;
Import-Module bitstransfer;
start-bitstransfer -source https://static.guance.com/datakit/install.ps1 -destination .install.ps1;
powershell ./.install.ps1;
The setting format of the two environment variables is:
# Windows: Multiple environment variables are divided by semicolons
$env:NAME1="value1"; $env:Name2="value2"
# Linux/Mac: Multiple environment variables are divided by spaces
NAME1="value1" NAME2="value2"
The environment variables supported by the installation script are as follows (supported by the whole platform).
Attention
- These environment variable settings are not supported for full offline installation. However, these environment variables can be set by proxy and setting local installation address.
- These environment variables are only effective in installation mode; they do not take effect in upgrade mode.
Most Commonly Used Environment Variables¶
DK_DATAWAY
: Specify the DataWay address, and the DataKit installation command has been brought by defaultDK_GLOBAL_TAGS
: Deprecated, DK_GLOBAL_HOST_TAGS insteadDK_GLOBAL_HOST_TAGS
: Support the installation phase to fill in the global host tag, format example:host=__datakit_hostname,host_ip=__datakit_ip
(multiple tags are separated by English commas)DK_GLOBAL_ELECTION_TAGS
: Support filling in the global election tag during the installation phase,format example:project=my-porject,cluster=my-cluster
(support filling in the global election tag during the installation phase)DK_DEF_INPUTS
: List of collector names opened by default, format example:cpu,mem,disk
. We can also ban some default inputs by putting a-
prefix at input name, such as-cpu,-mem,-disk
. But if mixed them, such ascpu,mem,-disk,-system
, we only accept the banned list, the effect is onlydisk
andsystem
disabled, but others enabled.DK_CLOUD_PROVIDER
: Support filling in cloud vendors during installation (Currently support following cloudsaliyun/aws/tencent/hwcloud/azure
). Deprecated: Datakit can infer cloud type automatically.DK_USER_NAME
:Datakit service running user name. Default isroot
. More details is in Attention below.DK_LITE
: When installing the simplified DataKit, you can set this variable to1
. ( Version-1.14.0)
Disable all default inputs Version-1.5.5
We can set DK_DEF_INPUTS
to -
to disable all default inputs:
DK_DEF_INPUTS="-" \
DK_DATAWAY=https://openway.guance.com?token=<TOKEN> \
bash -c "$(curl -L https://static.guance.com/datakit/install.sh)"
Beside, if Datakit has been installed before, we must delete all default inputs .conf files manually. During installing, Datakit able to add new inputs configure, not cant delete them.
Attention
For privilege reason, using DK_USER_NAME
with not root
name could cause following collector unavailable:
In addition, the following items need to be noted.
-
Manually create user and group first, then start install. There are difference between Linux distribution releases, below commands are for reference:
On DataKit's Own Log¶
DK_LOG_LEVEL
: Optional info/debugDK_LOG
: If changed to stdout, the log will not be written to the file, but will be output by the terminal.DK_GIN_LOG
: If changed to stdout, the log will not be written to the file, but will be output by the terminal.
On DataKit pprof¶
DK_ENABLE_PPROF
(deprecated): whether to turn onpprof
DK_PPROF_LISTEN
:pprof
service listening address
Version-1.9.2 enabled pprof by default.
On DataKit Election¶
DK_ENABLE_ELECTION
: Open the election, not by default. If you need to open it, give any non-empty string value to the environment variable. (egTrue
/False
)DK_NAMESPACE
: Supports namespaces specified during installation (for election)
On HTTP/API Environment¶
DK_HTTP_LISTEN
: Support the installation-stage specified DataKit HTTP service binding network card (defaultlocalhost
)DK_HTTP_PORT
: Support specifying the port of the DataKit HTTP service binding during installation (default9529
)DK_RUM_ORIGIN_IP_HEADER
: RUM-specificDK_DISABLE_404PAGE
: Disable the DataKit 404 page (commonly used when deploying DataKit RUM on the public network. Such asTrue
/False
)DK_INSTALL_IPDB
: Specify the IP library at installation time (currently onlyiploc
andgeolite2
is supported)DK_UPGRADE_IP_WHITELIST
: Starting from Datakit 1.5.9, we can upgrade Datakit by access remote http API. This environment variable is used to set the IP whitelist of clients that can be accessed remotely(multiple IPs could be separated by commas,
). Access outside the whitelist will be denied (default not restricted).DK_UPGRADE_LISTEN
: Specify DK-Upgrader HTTP server address(default0.0.0.0:9542
) Version-1.38.1DK_HTTP_PUBLIC_APIS
: Specify which Datakit HTTP APIs can be accessed by remote, generally config combined with RUM input,support from Datakit 1.9.2.
On DCA¶
DK_DCA_ENABLE
: Support DCA service to be turned on during installation (not turned on by default)DK_DCA_LISTEN
: Support custom configuration of DCA service listening addresses and ports during installation (default0.0.0.0:9531
)DK_DCA_WHITE_LIST
: Support setup of DCA service access whitelist, multiple whitelists split (e.g.192.168.0.1/24,10.10.0.1/24
)
On External Collector¶
DK_INSTALL_EXTERNALS
: Used to install external collectors not packaged with DataKit
On Confd Configuration¶
Environment Variable Name | Type | Applicable Scenario | Description | Sample Value |
---|---|---|---|---|
DK_CONFD_BACKEND | string | All | Backend Source Type | etcdv3 , zookeeper , redis or consul |
DK_CONFD_BASIC_AUTH | string | etcdv3 , consul |
Optional | |
DK_CONFD_CLIENT_CA_KEYS | string | etcdv3 , consul |
Optional | |
DK_CONFD_CLIENT_CERT | string | etcdv3 , consul |
Optional | |
DK_CONFD_CLIENT_KEY | string | etcdv3 , consul or redis |
Optional | |
DK_CONFD_BACKEND_NODES | string | All | Backend Source Address | [IP 地址:2379,IP address 2:2379] |
DK_CONFD_PASSWORD | string | etcdv3 , consul |
Optional | |
DK_CONFD_SCHEME | string | etcdv3 , consul |
Optional | |
DK_CONFD_SEPARATOR | string | redis |
Optional default 0 | |
DK_CONFD_USERNAME | string | etcdv3 , consul |
Optional |
On Git Configuration¶
DK_GIT_URL
: The remote git repo address for managing configuration files. (e.g.http://username:password@github.com/username/repository.git
)DK_GIT_KEY_PATH
: The full path of the local PrivateKey. (e.g./Users/username/.ssh/id_rsa
)DK_GIT_KEY_PW
: The password to use the local PrivateKey. (e.g.passwd
)DK_GIT_BRANCH
: Specify the branch to pull. If it is empty, it is the default, and the default is the remotely specified main branch, which is usuallymaster
.DK_GIT_INTERVAL
: The interval of the timed pull. (e.g.1m
)
On Sinker Configuration¶
DK_SINKER_GLOBAL_CUSTOMER_KEYS
used to setup sinker tag/field keys, here is the example:
Remove-Item -ErrorAction SilentlyContinue Env:DK_*;
$env:DK_DATAWAY="https://openway.guance.com?token=<TOKEN>";
$env:DK_DATAWAY_ENABLE_SINKER="on";
$env:DK_SINKER_GLOBAL_CUSTOMER_KEYS="key1,key2";
Set-ExecutionPolicy Bypass -scope Process -Force;
Import-Module bitstransfer;
start-bitstransfer -source https://static.guance.com/datakit/install.ps1 -destination .install.ps1;
powershell ./.install.ps1;
On Resource Limit Configuration¶
Only Linux and Windows ( Version-1.15.0) operating system are supported.
DK_LIMIT_DISABLED
: Turn off Resource limit function (on by default)DK_LIMIT_CPUMAX
: Maximum CPU power, default 30.0DK_LIMIT_MEMMAX
: Limit memory (including swap), default 4096 (4GB)
APM Instrumentation¶
By specifying DK_APM_INSTRUMENTATION_ENABLED=host
in the installation command, you can automatically inject APM for Java/Python applications:
DK_APM_INSTRUMENTATION_ENABLED=host \
DK_DATAWAY=https://openway.guance.com?token=<TOKEN> \
bash -c "$(curl -L https://static.guance.com/datakit/install.sh)"
After Datakit is installed, reopen a shell and restart the corresponding Java/Python applications.
To enable or disable this feature, modify the value of the instrumentation_enabled
configuration under [apm_inject]
in the datakit.conf
file:
- Value
"host"
, enable - Value
""
or"disable"
, disable
Operating environment requirements:
- Linux system
- CPU architecture: x86_64 or arm64
- C standard library: glibc 2.4 and above, or musl
- Java 8 and above
- Python 3.7 and above
In Kubernetes, you can inject APM through the Datakit Operator.
Other Installation Options¶
Environment Variable Name | Sample | Description |
---|---|---|
DK_INSTALL_ONLY |
on |
Install only, not run |
DK_HOSTNAME |
some-host-name |
Support custom configuration hostname during installation |
DK_UPGRADE |
1 |
Upgrade to the latest version |
DK_UPGRADE_MANAGER |
on |
Whether we upgrade the Remote Upgrade Service when upgrading Datakit, it's used in conjunction with DK_UPGRADE , supported start from 1.5.9 |
DK_INSTALLER_BASE_URL |
https://your-url |
You can choose the installation script for different environments, default to https://static.guance.com/datakit |
DK_PROXY_TYPE |
- | Proxy type. The options are: datakit or nginx , both lowercase |
DK_NGINX_IP |
- | Proxy server IP address (only need to fill in IP but not port). With the highest priority, this is mutually exclusive with the above "HTTP_PROXY" and "HTTPS_PROXY" and will override both. |
DK_INSTALL_LOG |
- | Set the setup log path, default to install.log in the current directory, if set to stdout , output to the command line terminal. |
HTTPS_PROXY |
IP:Port |
Installed through the Datakit agent |
DK_INSTALL_RUM_SYMBOL_TOOLS |
on |
Install source map tools for RUM, support from Datakit 1.9.2. |
DK_VERBOSE |
on |
Enable more verbose info during install(only for Linux/Mac) Version-1.19.0 |
DK_CRYPTO_AES_KEY |
0123456789abcdfg |
Use the encrypted password decryption key to protect plaintext passwords in the collector. Version-1.31.0 |
DK_CRYPTO_AES_KEY_FILE |
/usr/local/datakit/enc4dk |
Another way to configure the secret key takes priority over the previous one. Put the key into the file and configure the configuration file path through environment variables. |
FAQ¶
How to Deal with the Unfriendly Host Name¶
Because DataKit uses Hostname as the basis for data concatenation, in some cases, some host names are not very friendly, such as iZbp141ahn....
, but for some reasons, these host names cannot be modified, which brings some troubles to use. In DataKit, this unfriendly host name can be overwritten in the main configuration.
In datakit.conf
, modify the following configuration and the DataKit will read ENV_HOSTNAME
to overwrite the current real hostname:
Note: If a host has collected data for a period of time, after changing the host name, the historical data will no longer be associated with the new host name. Changing the host name is equivalent to adding a brand-new host.
Issue on macOS installation¶
If it appears during the installation/upgrade process when installing on macOS:
"launchctl" failed with stderr: /Library/LaunchDaemons/cn.dataflux.datakit.plist: Service is disabled
# or
"launchctl" failed with stderr: /Library/LaunchDaemons/com.guance.datakit.plist: Service is disabled
Execute:
Then execute the following command:
sudo launchctl load -w /Library/LaunchDaemons/cn.dataflux.datakit.plist
# or
sudo launchctl load -w /Library/LaunchDaemons/com.guance.datakit.plist
Are there any high-risk operations on files and data in Datakit?¶
During its operation, Datakit reads a significant amount of system information based on the collection configuration, such as process lists, hardware and software information (e.g., OS information, CPU, memory, disk, network card, etc.). However, it does not proactively execute deletion or modification of data outside of itself. About file reading and writing, there are two parts: one related to data collection read file/port operations, and one for the necessary file reading and writing operations during Datakit's own runtime.
Host files read/write during data collecting:
-
During process information collection and hardware and software information collection, Linux systems will read relevant information from the /proc directory; Windows systems mainly use WMI and the Golang Windows SDK to obtain these information.
-
If log collection is configured, Datakit will scan and read logs that match the configuration (e.g., syslog, user application logs, etc.).
-
Port usage: Datakit may open some ports to receive external data for interfacing with other systems. These ports are opened as needed based on the collector.
-
eBPF collection: Due to its particularity, eBPF requires more binary information of the Linux kernel and processes, resulting in the following actions:
- Analyze the binary files of all (or specified) running programs (dynamic libraries, processes within containers) for symbols and addresses.
- Read and write files under the kernel DebugFS mount point or interact with the PMU (Performance Monitoring Unit) to place kprobe/uprobe/tracepoint eBPF probes.
- uprobe probes will modify the CPU instructions of user processes to read relevant data.
In addition to collection, Datakit performs the following file reading and writing operations:
- Its own log files
On Linux, these are located in the /var/log/datakit/ directory; on Windows, they are located in the C:\Program Files\datakit directory.
Log files will automatically rotate when they reach a specified size (default 32MB), with a maximum number of rotations (default maximum of 5 + 1 segments).
- Disk cache
Some data collection requires the use of disk cache functionality (which must be manually enabled). This cache will involve file creation and deletion during the generation and consumption process. Disk cache also has a maximum capacity setting; when full, it will automatically perform FIFO deletion operations to prevent disk overflow.
How does Datakit control its own resource consumption?¶
Datakit's resource usage can be limited through mechanisms such as cgroup. For more information, see here. If Datakit is deployed in Kubernetes, see here.
What is Datakit's own observability?¶
During its operation, Datakit exposes many internal metrics. By default, Datakit collects these metrics using the built-in collector and reports them to the user's workspace.
In addition, Datakit also comes with a monitor command-line tool that allows users to view the current operational status as well as the collection and reporting status.