Skip to content

Best Practices for Host Observability (Linux)


Basic Overview

Linux, fully known as GNU/Linux, is a free-to-use and freely distributed Unix-like operating system. As the most widely used operating system in enterprises, its stability is crucial. Guance, through years of customer experience, has achieved comprehensive host observability, helping customers quickly gain insights into infrastructure operations and significantly reducing maintenance costs.

Scenario Overview

< Guance - Scenario - Dashboard - Create Dashboard - Host Overview_Linux >

image.png image.png image.png image.png image.png image.png

Prerequisites

Visit the official website Guance to register an account and log in using your registered credentials.

Deployment Implementation

One-click Installation

DataKit is the official data collection application released by Guance, supporting the collection of hundreds of types of data.

Log in to the Guance console, click on "Integration" - "DataKit", copy the command line, and run it directly on the server.

image.png

Default Paths

Directory Path
Installation Directory /usr/local/datakit/
Log Directory /var/log/datakit/
Main Configuration File /usr/local/datakit/conf.d/datakit.conf
Plugin Configuration Directory /usr/local/datakit/conf.d/

Default Plugins

After installation, some plugins (data collection) are enabled by default, which can be viewed in the main configuration file datakit.conf

default_enabled_inputs = ["cpu", "disk", "diskio", "mem", "swap", "hostobject", "net", "host_processes", "container", "system"]

Plugin Description:

Metric data can be viewed in [ Guance - Metrics ], and object data can be viewed directly on the relevant pages.

Plugin Name Description Data Type
cpu Collects CPU usage information Metrics
disk Collects disk usage information Metrics
diskio Collects disk IO information Metrics
mem Collects memory usage information Metrics
swap Collects Swap memory usage information Metrics
system Collects host OS load information Metrics
net Collects network traffic information Metrics
host_processes Collects long-running (more than 10 minutes) process lists Object
hostobject Collects basic host information (such as OS information, hardware information, etc.) Object

Data Collection

When viewing metrics using Guance, you can use tags for quick condition filtering.

Default Collection

CPU Metrics

[ Guance - Metrics - cpu, view CPU status data ] [ Guance - Metrics - system, view CPU load and core count data ]

image.png

Memory Metrics

[ Guance - Metrics - mem, view memory data ] [ Guance - Metrics - swap, view swap memory data ]

image.png

Disk Metrics

[ Guance - Metrics - disk, view disk data ] [ Guance - Metrics - diskio, view disk IO data ]

image.png

Network Metrics

[ Guance - Metrics - net, view network data ]

image.png

Host Objects

[ Guance - Infrastructure - Host, view all host object lists ]

image.png

[ Guance - Infrastructure - Host - Click any host to view basic system information ]

Integration status represents the list of plugins running on that server

image.png

Process Objects

[ Guance - Infrastructure - Process, view all process object lists ]

image.png

[ Guance - Infrastructure - Process - Click any process name to view related information ]

image.png

Advanced Collection

In addition to the default metrics/object data, DataKit can enhance OS monitoring data with other plugins.

Process List

To get real-time process list information from all hosts, enable the process plugin (global top feature)

  1. Enter the plugin configuration directory and copy the sample file
cd /usr/local/datakit/conf.d/host/
cp host_processes.conf.sample host_processes.conf
vi host_processes.conf
  1. Enable the process plugin
[[inputs.host_processes]]
  min_run_time = "10m"
  open_metric = true
  1. Restart DataKit
systemctl restart datakit

[ Guance - Metrics - host_processes, view process data ]

image.png

Network Interface Metrics

Use ebpf technology to collect TCP/UDP connection information from host network interfaces

  1. Install the ebpf plugin
datakit install --datakit-ebpf
  1. Enter the plugin configuration directory and copy the sample file
cd /usr/local/datakit/conf.d/host
cp ebpf.conf.sample ebpf.conf
vi ebpf.conf
  1. Enable the ebpf plugin
[[inputs.ebpf]]
  daemon = true
  name = 'ebpf'
  cmd = "/usr/local/datakit/externals/datakit-ebpf"
  args = ["--datakit-apiserver", "0.0.0.0:9529"]
  enabled_plugins = ["ebpf-net"]
  1. Restart DataKit
systemctl restart datakit

[ Guance - Infrastructure - Host - Click the host with the ebpf plugin installed - Network, view system network interface information ]

image.png

Security Check

Perform real-time detection of security vulnerabilities on the host operating system

  1. Install the Scheck service
bash -c "$(curl https://static.dataflux.cn/security-checker/install.sh)"

Installation Instructions

Directory Path
Installation Directory /usr/local/scheck
Log Directory /usr/local/scheck/log
Main Configuration File /usr/local/scheck/scheck.conf
Detection Rule Directory /usr/local/scheck/rules.d
  1. Modify the main configuration file
rule_dir='/usr/local/scheck/rules.d'
output='http://127.0.0.1:9529/v1/write/security'
log='/usr/local/scheck/log'
log_level='info'
  1. Start the service
systemctl start scheck

[ Guance - Security Check - Explorer, view all security events ]

image.png

Extended Collection

In addition to its own data collection, DataKit is fully compatible with the Telegraf collector.

Install Telegraf, for CentOS as an example, refer to the Telegraf Official Documentation for other systems

  1. Add yum repository
cat <<EOF | tee /etc/yum.repos.d/influxdb.repo
[influxdb]
name = InfluxDB Repository - RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key
EOF
  1. Install the Telegraf collector
yum -y install telegraf
  1. Modify the main configuration file telegraf.conf
vi /etc/telegraf/telegraf.conf
  1. Disable influxdb, enable outputs.http (to upload data to DataKit)
#[[outputs.influxdb]]
[[outputs.http]]
url = "http://127.0.0.1:9529/v1/write/metric?input=telegraf"
  1. Disable default Telegraf collections
#[[inputs.cpu]]
#  percpu = true
#  totalcpu = true
#  collect_cpu_time = false
#  report_active = false
#[[inputs.disk]]
#  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
#[[inputs.diskio]]
#[[inputs.mem]]
#[[inputs.processes]]
#[[inputs.swap]]
#[[inputs.system]]
  1. Start Telegraf
systemctl start telegraf

Port Metrics

Monitor important ports in the operating system

  1. Modify the main configuration file telegraf.conf
vi /etc/telegraf/telegraf.conf
  1. Enable port monitoring
[[inputs.net_response]]
  protocol = "tcp"
  address = "localhost:9090"
  timeout = "3s"
[[inputs.net_response]]
  protocol = "tcp"
  address = "localhost:22"
  timeout = "3s"
  1. Restart Telegraf
systemctl restart telegraf

[ Guance - Metrics - net_response, view port data ]

image.png

Process Metrics

Monitor important processes in the operating system

  1. Modify the main configuration file telegraf.conf
vi /etc/telegraf/telegraf.conf
  1. Enable process monitoring
[[inputs.procstat]]
    pattern = "zookeeper"
[[inputs.procstat]]
    pattern = "httpd"
  1. Restart Telegraf
systemctl restart telegraf

[ Guance - Metrics - procstat, view process data ]

image.png

Single-point Dial Testing

Use this machine as a dial test point to monitor important interfaces/sites

For multi-point dial testing, see Synthetic Tests

  1. Modify the main configuration file telegraf.conf
vi /etc/telegraf/telegraf.conf
  1. Enable HTTP monitoring
[[inputs.http_response]]
    urls = ["https://www.baidu.com","https://guance.com","http://localhost:9090"]
  1. Restart Telegraf
systemctl restart telegraf

[ Guance - Metrics - http_response, view dial test data ]

image.png

Monitoring Rules

Set up alert rules and notification targets to understand system stability in real-time

Built-in Templates

Guance already includes some built-in templates for monitoring libraries, which can be used directly

[ Guance - Monitoring - Create from Template - Host Monitoring Library] [ Guance - Monitoring - Create from Template - Ping Status Monitoring Library] [ Guance - Monitoring - Create from Template - Port Monitoring Library]

Custom Monitoring Libraries

Add monitoring rules via customization; Guance supports multiple types of monitoring, such as thresholds, processes, logs, and network monitoring.

Threshold Monitoring

[ Guance - Monitoring - Create Monitor - Threshold Monitoring ]

Monitored metric: Alarm rule expression, where is the data table, is the monitored metric, and is the tag (only tags in the by clause can be referenced in event content)

image.png

Trigger Conditions: Final threshold range, triggering an alarm when conditions are met; after triggering, if the conditions are no longer met, it can recover (specify the check cycle in Normal).

image.png

Event names/content can reference variables, and event content uses markdown text format (e.g., a new line is two spaces)

image.png

Notification Targets

Customize alarm rule notification targets

[ Guance - Manage - Notification Targets ]

image.png

Group monitors and add notification targets

[ Guance - Monitoring - Monitors - Grouping - Alert Configuration ]

image.png

Feedback

Is this page helpful? ×