Skip to content

Host Observability Best Practices (Linux)


Basic Overview

Linux, fully named GNU/Linux, is a freely usable and freely distributable Unix-like operating system. As the most widely used operating system in enterprises, its stability is necessarily the most critical aspect. Guance has achieved full coverage of host observability through years of customer experience accumulation, helping customers quickly understand the operation of their infrastructure and drastically reducing maintenance costs.

Scene Overview

< Guance - Scene - Dashboard - Create Dashboard - Host Overview_Linux >

image.png image.png image.png image.png image.png image.png

Prerequisites

Go to the official website Guance to register an account, and log in using the registered account/password.

Deployment Implementation

One-Click Installation

DataKit is the official data collection application released by Guance, supporting the collection of hundreds of types of data.

Log in to the Guance console, click on "Integration" - "DataKit", copy the command line and run it directly on the server.

image.png

Default Paths

Directory Path
Installation Directory /usr/local/datakit/
Log Directory /var/log/datakit/
Main Configuration File /usr/local/datakit/conf.d/datakit.conf
Plugin Configuration Directory /usr/local/datakit/conf.d/

Default Plugins

After installation, some plugins (data collection) will be enabled by default. These can be viewed in the main configuration file datakit.conf.

default_enabled_inputs = ["cpu", "disk", "diskio", "mem", "swap", "hostobject", "net", "host_processes", "container", "system"]

Plugin Description:

Metric data can be viewed in [ Guance - Metrics ], and object data can be viewed directly on relevant pages.

Plugin Name Description Data Type
cpu Collects CPU usage information from the host Metrics
disk Collects disk usage information Metrics
diskio Collects disk IO usage information from the host Metrics
mem Collects memory usage information from the host Metrics
swap Collects Swap memory usage information Metrics
system Collects operating system load information from the host Metrics
net Collects network traffic information from the host Metrics
host_processes Collects a list of resident processes (alive for over 10 minutes) on the host Objects
hostobject Collects basic host information (such as OS information, hardware information, etc.) Objects

Data Collection

When viewing metrics with Guance, you can use tags for quick condition filtering.

Default Collection

CPU Metrics

[ Guance - Metrics - cpu, view CPU status data ] [ Guance - Metrics - systecm, view CPU load and core count data ]

image.png

Memory Metrics

[ Guance - Metrics - mem, view memory data ] [ Guance - Metrics - swap, view memory swap data ]

image.png

Disk Metrics

[ Guance - Metrics - disk, view disk data ] [ Guance - Metrics - disk, view disk IO data ]

image.png

Network Metrics

[ Guance - Metrics - net, view network data ]

image.png

Host Objects

[ Guance - Infrastructure - Host, view all host object lists ]

image.png

[ Guance - Infrastructure - Host - Click any host to view basic system information ]

Integration runtime status represents the list of plugins already running on this server

image.png

Process Objects

[ Guance - Infrastructure - Process, view all process object lists ]

image.png

[ Guance - Infrastructure - Process - Click any process name to view related process information ]

image.png

Advanced Collection

In addition to the default metric/object data, DataKit can also complete operating system monitoring data through other plugins.

Process List

To understand real-time process list information for all hosts, enable the process plugin (global top functionality).

  1. Enter the plugin configuration directory and copy the sample file
cd /usr/local/datakit/conf.d/host/
cp host_processes.conf.sample host_processes.conf
vi host_processes.conf
  1. Enable the process plugin
[[inputs.host_processes]]
  min_run_time = "10m"
  open_metric = true
  1. Restart DataKit
systemctl restart datakit

[ Guance - Metrics - host_processes, view process data ]

image.png

Network Interface Metrics

Use ebpf technology to collect tcp/udp connection information for the host's network interface.

  1. Install the ebpf plugin
datakit install --datakit-ebpf
  1. Enter the plugin configuration directory and copy the sample file
cd /usr/local/datakit/conf.d/host
cp ebpf.conf.sample ebpf.conf
vi ebpf.conf
  1. Enable the ebpf plugin
[[inputs.ebpf]]
  daemon = true
  name = 'ebpf'
  cmd = "/usr/local/datakit/externals/datakit-ebpf"
  args = ["--datakit-apiserver", "0.0.0.0:9529"]
  enabled_plugins = ["ebpf-net"]
  1. Restart DataKit
systemctl restart datakit

[ Guance - Infrastructure - Host - Click on the host where the ebpf plugin is installed - Network, view system network interface information ]

image.png

Security Check

Perform real-time detection of security vulnerabilities on the host operating system.

  1. Install the Scheck service
bash -c "$(curl https://static.dataflux.cn/security-checker/install.sh)"

Installation Instructions

Directory Path
Installation Directory /usr/local/scheck
Log Directory /usr/local/scheck/log
Main Configuration File /usr/local/scheck/scheck.conf
Detection Rule Directory /usr/local/scheck/rules.d
  1. Modify the main configuration file
rule_dir='/usr/local/scheck/rules.d'
output='http://127.0.0.1:9529/v1/write/security'
log='/usr/local/scheck/log'
log_level='info'
  1. Start the service
systemctl start scheck

[ Guance - Security Check - Explorer, view all security events ]

image.png

Extended Collection

In addition to its own data collection, DataKit is fully compatible with the telegraf collector.

Install Telegraf, taking CentOS as an example; for other systems, refer to the Telegraf Official Documentation

  1. Add yum source
cat <<EOF | tee /etc/yum.repos.d/influxdb.repo
[influxdb]
name = InfluxDB Repository - RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key
EOF
  1. Install the telegraf collector
yum -y install telegraf
  1. Modify the main configuration file telegraf.conf
vi /etc/telegraf/telegraf.conf
  1. Disable influxdb, enable outputs.http (to upload data to datakit)
#[[outputs.influxdb]]
[[outputs.http]]
url = "http://127.0.0.1:9529/v1/write/metric?input=telegraf"
  1. Disable telegraf default collection
#[[inputs.cpu]]
#  percpu = true
#  totalcpu = true
#  collect_cpu_time = false
#  report_active = false
#[[inputs.disk]]
#  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
#[[inputs.diskio]]
#[[inputs.mem]]
#[[inputs.processes]]
#[[inputs.swap]]
#[[inputs.system]]
  1. Start telegraf
systemctl start telegraf

Port Metrics

Detect important ports in the operating system.

  1. Modify the main configuration file telegraf.conf
vi /etc/telegraf/telegraf.conf
  1. Enable port detection
[[inputs.net_response]]
  protocol = "tcp"
  address = "localhost:9090"
  timeout = "3s"
[[inputs.net_response]]
  protocol = "tcp"
  address = "localhost:22"
  timeout = "3s"
  1. Restart telegraf
systemctl restart telegraf

[ Guance - Metrics - net_response, view port data ]

image.png

Process Metrics

Detect important processes in the operating system.

  1. Modify the main configuration file telegraf.conf
vi /etc/telegraf/telegraf.conf
  1. Enable process detection
[[inputs.procstat]]
    pattern = "zookeeper"
[[inputs.procstat]]
    pattern = "httpd"
  1. Restart telegraf
systemctl restart telegraf

[ Guance - Metrics - procstat, view process data ]

image.png

Single-point Testing

Using the local machine as a testing point, detect important interfaces/sites.

For multi-point testing, see Synthetic Tests.

  1. Modify the main configuration file telegraf.conf
vi /etc/telegraf/telegraf.conf
  1. Enable HTTP detection
[[inputs.http_response]]
    urls = ["https://www.baidu.com","https://guance.com","http://localhost:9090"]
  1. Restart telegraf
systemctl restart telegraf

[ Guance - Metrics - http_response, view test data ]

image.png

Monitoring Rules

Used to set alarm rules and notification targets to monitor system stability in real time.

Built-in Templates

Guance already includes some built-in detection library templates that can be used directly.

[ Guance - Monitoring - Create from Template - Host Detection Library] [ Guance - Monitoring - Create from Template - Ping Status Detection Library] [ Guance - Monitoring - Create from Template - Port Detection Library]

Custom Detection Libraries

Add detection rules through customization. Guance supports multiple detections such as thresholds, processes, logs, and network detection.

Threshold Detection

[ Guance - Monitoring - Create Monitor - Threshold Detection ]

Detection Metric: Alarm rule expression, where is the data table, is the monitoring metric, is the tag (only tags in the by conditions can be referenced in the event content).

image.png

Trigger Condition: Final threshold range, triggers an alarm when the condition is met; after triggering, if the threshold is not met again upon rechecking, it can recover (normal needs to have a detection cycle filled out).

image.png

Event name/content can reference variables, event content uses markdown text format (for example, a new line requires two spaces).

image.png

Notification Targets

Customize settings for alarm rule notification targets.

[ Guance - Manage - Notification Targets ]

image.png

Group monitors and add notification targets according to the monitors.

[ Guance - Monitoring - Monitors - Grouping - Alert Configuration ]

image.png

Feedback

Is this page helpful? ×