Managing DataKit Configuration with Git¶

This article explains how to use Git to manage DataKit configurations, including collection configurations and Pipeline scripts. By maintaining a local or remote Git repository, you can manage changes to DataKit's configurations while leveraging Git's version control features to track historical changes.

Operating Mechanism¶

DataKit integrates Git client functionality, regularly (by default every 1 minute) pulling the latest configuration data from the Git repository. By loading these up-to-date configurations, DataKit achieves configuration updates.

Usage Example¶

The complete steps for the usage example are as follows:

Create a Git repository
Plan the repository's configuration according to established directory rules
Push the configuration to the Git repository
Add the Git repository to the main DataKit configuration
Restart DataKit

Note

The Git repository does not have to be created in this order. For example, you can first create a remote repository address and then clone it to make changes. The following example creates a local Git repository first, then pushes it to a remote repository.

Create a Git Repository¶

First, create a local Git repository:

mkdir datakit-repo
git init

Directory Planning¶

Create various basic directories:

mkdir -p conf.d   && touch conf.d/.gitkeep
mkdir -p pipeline && touch pipeline/.gitkeep
mkdir -p python.d && touch python.d/.gitkeep

Push Configuration¶

Use common Git commands to push configuration changes to the repository:

# cd your/path/to/repo
git add conf.d pipeline python.d

# Add any conf or pipeline to path conf.d/pipeline/python.d...

git commit -m "init datakit repo"

# Push the repo to YOUR GitHub (ssh or https)
git remote add origin ssh://git@github.com/PATH/TO/datakit-confs.git
git push origin --all

Configure the Repository in DataKit¶

Enable the git_repos feature in the datakit.conf configuration file, locate git_repos, as shown below:

[[git_repos.repo]]
    enable = true # Enable the repo

    ###########################################
    # Git support http/git/ssh authentication
    ###########################################
    url = "http://username:password@github.com/PATH/TO/datakit-confs.git" 

    branch = "master" # Specify which branch to pull

    # git/ssh authentication requires key-path key-password configuration
    # url = "git@github.com:PATH/TO/datakit-confs.git"
    # url = "ssh://git@github.com/PATH/TO/datakit-confs.git"
    # ssh_private_key_path = "/Users/username/.ssh/id_rsa"
    # ssh_private_key_password = "<YOUR-PASSWORD>"

If the password contains special characters, refer to here.

Restart DataKit¶

After the configuration is complete, restart DataKit. After a short wait, you can check the status of the collectors through DataKit Monitor.

Git Usage in Kubernetes¶

Refer to here.

FAQ¶

Error: authentication required¶

This error may occur in the following situations.

If using SSH, it is generally because the provided key is incorrect. If using HTTP, it may be due to:

Incorrect username and password provided
The protocol for the git address is filled in incorrectly

For example, the original address is

https://username:password@github.com/path/to/repository.git

And it was written as

http://username:password@github.com/path/to/repository.git

That is, https was changed to http, which will also result in this error. Change http to https here to resolve it.

Repository Directory Constraints¶

The Git repository must be stored with the following directory structure for various configurations:

+── conf.d    # 
├── pipeline  # Dedicated to storing pipeline scripts
└── python.d  # Store python scripts

Among them:

conf.d is dedicated to storing collector configurations, and its subdirectories can be planned arbitrarily (subdirectories are allowed), any collector configuration file just needs to end with .conf
pipeline is used to store Pipeline scripts, and it is recommended to plan Pipeline scripts according to data type
python.d is used to store Python scripts

Here is an example of DataKit's directory structure after Git synchronization is enabled:

DataKit root directory
├── conf.d   # Default main configuration directory
├── pipeline # Top-level Pipeline scripts
├── python.d # Top-level python scripts
└── gitrepos
    ├── repo-1        # Repository 1
    │   ├── conf.d    # Dedicated to storing collector configurations
    │   ├── pipeline  # Dedicated to storing pipeline scripts
    │   └── python.d  # Store python scripts
    └── repo-2        # Repository 2
        ├── ...

Git Configuration Loading Mechanism¶

After Git synchronization is enabled, the configuration (.conf/pipeline) priority is defined as follows:

All collector configurations are loaded from the gitrepos directory
The order of Git repository loading is based on the order in which they appear in datakit.conf
For Pipeline, the first found Pipeline file is used. As shown in the example, when looking for nginx.p, if found in repo-1, it will not look in repo-2. If neither of these repositories has nginx.p, then look in the top-level Pipeline directory. The search mechanism for Python is the same.

Attention

After enabling the remote Pipeline feature, the first loaded Pipeline is the one synchronized from the center.

After enabling Git synchronization, the original collector configurations in the conf.d directory will no longer be effective. In addition, the main configuration datakit.conf cannot be managed through Git.