Best Practices for Host Observability (Linux)¶
Basic Overview¶
Linux, fully known as GNU/Linux, is a free-to-use and freely distributed Unix-like operating system. As the most widely used operating system in enterprises, its stability is crucial. Guance, through years of customer experience, has achieved comprehensive host observability, helping customers quickly gain insights into infrastructure operations and significantly reducing maintenance costs.
Scenario Overview¶
< Guance - Scenario - Dashboard - Create Dashboard - Host Overview_Linux >
Prerequisites¶
Visit the official website Guance to register an account and log in using your registered credentials.
Deployment Implementation¶
One-click Installation¶
DataKit is the official data collection application released by Guance, supporting the collection of hundreds of types of data.
Log in to the Guance console, click on "Integration" - "DataKit", copy the command line, and run it directly on the server.
Default Paths¶
Directory | Path |
---|---|
Installation Directory | /usr/local/datakit/ |
Log Directory | /var/log/datakit/ |
Main Configuration File | /usr/local/datakit/conf.d/datakit.conf |
Plugin Configuration Directory | /usr/local/datakit/conf.d/ |
Default Plugins¶
After installation, some plugins (data collection) are enabled by default, which can be viewed in the main configuration file datakit.conf
default_enabled_inputs = ["cpu", "disk", "diskio", "mem", "swap", "hostobject", "net", "host_processes", "container", "system"]
Plugin Description:
Metric data can be viewed in [ Guance - Metrics ], and object data can be viewed directly on the relevant pages.
Plugin Name | Description | Data Type |
---|---|---|
cpu | Collects CPU usage information | Metrics |
disk | Collects disk usage information | Metrics |
diskio | Collects disk IO information | Metrics |
mem | Collects memory usage information | Metrics |
swap | Collects Swap memory usage information | Metrics |
system | Collects host OS load information | Metrics |
net | Collects network traffic information | Metrics |
host_processes | Collects long-running (more than 10 minutes) process lists | Object |
hostobject | Collects basic host information (such as OS information, hardware information, etc.) | Object |
Data Collection¶
When viewing metrics using Guance, you can use tags for quick condition filtering.
Default Collection¶
CPU Metrics¶
[ Guance - Metrics - cpu, view CPU status data ] [ Guance - Metrics - system, view CPU load and core count data ]
Memory Metrics¶
[ Guance - Metrics - mem, view memory data ] [ Guance - Metrics - swap, view swap memory data ]
Disk Metrics¶
[ Guance - Metrics - disk, view disk data ] [ Guance - Metrics - diskio, view disk IO data ]
Network Metrics¶
[ Guance - Metrics - net, view network data ]
Host Objects¶
[ Guance - Infrastructure - Host, view all host object lists ]
[ Guance - Infrastructure - Host - Click any host to view basic system information ]
Integration status represents the list of plugins running on that server
Process Objects¶
[ Guance - Infrastructure - Process, view all process object lists ]
[ Guance - Infrastructure - Process - Click any process name to view related information ]
Advanced Collection¶
In addition to the default metrics/object data, DataKit can enhance OS monitoring data with other plugins.
Process List¶
To get real-time process list information from all hosts, enable the process plugin (global top feature)
- Enter the plugin configuration directory and copy the sample file
cd /usr/local/datakit/conf.d/host/
cp host_processes.conf.sample host_processes.conf
vi host_processes.conf
- Enable the process plugin
- Restart DataKit
[ Guance - Metrics - host_processes, view process data ]
Network Interface Metrics¶
Use ebpf technology to collect TCP/UDP connection information from host network interfaces
- Install the ebpf plugin
- Enter the plugin configuration directory and copy the sample file
- Enable the ebpf plugin
[[inputs.ebpf]]
daemon = true
name = 'ebpf'
cmd = "/usr/local/datakit/externals/datakit-ebpf"
args = ["--datakit-apiserver", "0.0.0.0:9529"]
enabled_plugins = ["ebpf-net"]
- Restart DataKit
[ Guance - Infrastructure - Host - Click the host with the ebpf plugin installed - Network, view system network interface information ]
Security Check¶
Perform real-time detection of security vulnerabilities on the host operating system
- Install the Scheck service
Installation Instructions
Directory | Path |
---|---|
Installation Directory | /usr/local/scheck |
Log Directory | /usr/local/scheck/log |
Main Configuration File | /usr/local/scheck/scheck.conf |
Detection Rule Directory | /usr/local/scheck/rules.d |
- Modify the main configuration file
rule_dir='/usr/local/scheck/rules.d'
output='http://127.0.0.1:9529/v1/write/security'
log='/usr/local/scheck/log'
log_level='info'
- Start the service
[ Guance - Security Check - Explorer, view all security events ]
Extended Collection¶
In addition to its own data collection, DataKit is fully compatible with the Telegraf collector.
Install Telegraf, for CentOS as an example, refer to the Telegraf Official Documentation for other systems
- Add yum repository
cat <<EOF | tee /etc/yum.repos.d/influxdb.repo
[influxdb]
name = InfluxDB Repository - RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key
EOF
- Install the Telegraf collector
- Modify the main configuration file telegraf.conf
- Disable influxdb, enable outputs.http (to upload data to DataKit)
- Disable default Telegraf collections
#[[inputs.cpu]]
# percpu = true
# totalcpu = true
# collect_cpu_time = false
# report_active = false
#[[inputs.disk]]
# ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
#[[inputs.diskio]]
#[[inputs.mem]]
#[[inputs.processes]]
#[[inputs.swap]]
#[[inputs.system]]
- Start Telegraf
Port Metrics¶
Monitor important ports in the operating system
- Modify the main configuration file telegraf.conf
- Enable port monitoring
[[inputs.net_response]]
protocol = "tcp"
address = "localhost:9090"
timeout = "3s"
[[inputs.net_response]]
protocol = "tcp"
address = "localhost:22"
timeout = "3s"
- Restart Telegraf
[ Guance - Metrics - net_response, view port data ]
Process Metrics¶
Monitor important processes in the operating system
- Modify the main configuration file telegraf.conf
- Enable process monitoring
- Restart Telegraf
[ Guance - Metrics - procstat, view process data ]
Single-point Dial Testing¶
Use this machine as a dial test point to monitor important interfaces/sites
For multi-point dial testing, see Synthetic Tests
- Modify the main configuration file telegraf.conf
- Enable HTTP monitoring
[[inputs.http_response]]
urls = ["https://www.baidu.com","https://guance.com","http://localhost:9090"]
- Restart Telegraf
[ Guance - Metrics - http_response, view dial test data ]
Monitoring Rules¶
Set up alert rules and notification targets to understand system stability in real-time
Built-in Templates¶
Guance already includes some built-in templates for monitoring libraries, which can be used directly
[ Guance - Monitoring - Create from Template - Host Monitoring Library] [ Guance - Monitoring - Create from Template - Ping Status Monitoring Library] [ Guance - Monitoring - Create from Template - Port Monitoring Library]
Custom Monitoring Libraries¶
Add monitoring rules via customization; Guance supports multiple types of monitoring, such as thresholds, processes, logs, and network monitoring.
Threshold Monitoring¶
[ Guance - Monitoring - Create Monitor - Threshold Monitoring ]
Monitored metric: Alarm rule expression, where
Trigger Conditions: Final threshold range, triggering an alarm when conditions are met; after triggering, if the conditions are no longer met, it can recover (specify the check cycle in Normal).
Event names/content can reference variables, and event content uses markdown text format (e.g., a new line is two spaces)
Notification Targets¶
Customize alarm rule notification targets
[ Guance - Manage - Notification Targets ]
Group monitors and add notification targets
[ Guance - Monitoring - Monitors - Grouping - Alert Configuration ]