SNMP¶
本文主要介绍 SNMP 数据采集。
术语¶
SNMP
(Simple network management protocol): A network protocol that is used to collect information about bare metal networking gear.OID
(Object identifier): A unique ID or address on a device that when polled returns the response code of that value. For example, OIDs are CPU or device fan speed.sysOID
(System object identifier): A specific address that defines the device type. All devices have a unique ID that defines it. For example, the Meraki base sysOID is1.3.6.1.4.1.29671
.MIB
(Managed information base): A database or list of all the possible OIDs and their definitions that are related to the MIB. For example, theIF-MIB
(interface MIB) contains all the OIDs for descriptive information about a device’s interface.
关于 SNMP 协议¶
SNMP 协议分为 3 个版本:v1/v2c/v3,其中:
- v1 和 v2c 是兼容的。很多 SNMP 设备只提供 v2c 和 v3 两种版本的选择。v2c 版本,兼容性最好,很多旧设备只支持这个版本;
- 如果对安全性要求高,选用 v3。安全性也是 v3 版本与之前版本的主要区别;
Datakit 支持以上所有版本。
选择 v1/v2c 版本¶
如果选择 v1/v2c 版本,需要提供 community string
,中文翻译为「团体名/团体字符串/未加密的口令」,即密码,与 SNMP 设备进行交互需要提供这个进行鉴权。另外,有的设备会进一步进行细分,分为「只读团体名」和「读写团体名」。顾名思义:
- 只读团体名:设备只会向该方提供内部指标数据,不能修改内部的一些配置(Datakit 用这个就够了)
- 读写团体名:提供方拥有设备内部指标数据查询与部分配置修改权限
选择 v3 版本¶
如果选择 v3 版本,需要提供 「用户名」、「认证算法/密码」、「加密算法/密码」、「上下文」 等,各个设备要求不同,根据设备侧的配置进行填写。
配置¶
进入 DataKit 安装目录下的 conf.d/snmp
目录,复制 snmp.conf.sample
并命名为 snmp.conf
。示例如下:
[[inputs.snmp]]
## Filling in specific device IP address, like ["10.200.10.240", "10.200.10.241"].
## And you can use auto_discovery and specific_devices at the same time.
## If you don't want to specific device, you don't need provide this.
#
# specific_devices = ["***"] # SNMP Device IP.
## Filling in autodiscovery CIDR subnet, like ["10.200.10.0/24", "10.200.20.0/24"].
## If you don't want to enable autodiscovery feature, you don't need provide this.
#
# auto_discovery = ["***"] # Used in autodiscovery mode only, ignore this in other cases.
## SNMP protocol version the devices using, fill in 2 or 3.
## If you using the version 1, just fill in 2. Version 2 supported version 1.
## This is must be provided.
#
snmp_version = 2
## SNMP port in the devices. Default is 161. In most cases, you don't need change this.
## This is optional.
#
# port = 161
## Password in SNMP v2, enclose with single quote. Only worked in SNMP v2.
## If you are using SNMP v2, this is must be provided.
## If you are using SNMP v3, you don't need provide this.
#
# v2_community_string = "***"
## Authentication stuff in SNMP v3.
## If you are using SNMP v2, you don't need provide this.
## If you are using SNMP v3, this is must be provided.
#
# v3_user = "***"
# v3_auth_protocol = "***"
# v3_auth_key = "***"
# v3_priv_protocol = "***"
# v3_priv_key = "***"
# v3_context_engine_id = "***"
# v3_context_name = "***"
## Number of workers used to collect and discovery devices concurrently. Default is 100.
## Modifying it based on device's number and network scale.
## This is optional.
#
# workers = 100
## Interval between each autodiscovery in seconds. Default is "1h".
## Only worked in autodiscovery feature.
## This is optional.
#
# discovery_interval = "1h"
## Filling in excluded device IP address, like ["10.200.10.220", "10.200.10.221"].
## Only worked in autodiscovery feature.
## This is optional.
#
# discovery_ignored_ip = []
## Set true to enable election
#
# election = true
## Device Namespace. Default is "default".
#
# device_namespace = "default"
## Picking the metric data only contains the field's names below.
#
# enable_picking_data = true # Default is "false", which means collecting all data.
# status = ["sysUpTimeInstance", "tcpCurrEstab", "ifAdminStatus", "ifOperStatus", "cswSwitchState"]
# speed = ["ifHCInOctets", "ifHCInOctetsRate", "ifHCOutOctets", "ifHCOutOctetsRate", "ifHighSpeed", "ifSpeed", "ifBandwidthInUsageRate", "ifBandwidthOutUsageRate"]
# cpu = ["cpuUsage"]
# mem = ["memoryUsed", "memoryUsage", "memoryFree"]
# extra = []
[inputs.snmp.tags]
# tag1 = "val1"
# tag2 = "val2"
[inputs.snmp.traps]
# enable = true
# bind_host = "0.0.0.0"
# port = 9162
# stop_timeout = 3 # stop timeout in seconds.
配置好后,重启 DataKit 即可。
目前可以通过 ConfigMap 方式注入采集器配置来开启采集器。
Tip
上述配置完成后,可以使用 datakit debug --input-conf
命令来测试配置是否正确,示例如下:
如果正确会输出行协议信息,否则看不到行协议信息。
Attention
- 上面配置的
inputs.snmp.tags
中如果与原始 fields 中的 key 同名重复,则会被原始数据覆盖 - 设备的 IP 地址(指定设备模式)/网段(自动发现模式)、SNMP 协议的版本号及相对应的鉴权字段是必填字段
- 「指定设备」模式和「自动发现」模式,两种模式可以共存,但设备间的 SNMP 协议的版本号及相对应的鉴权字段必须保持一致
配置 SNMP¶
- 在设备侧,配置 SNMP 协议
SNMP 设备在默认情况下,一般 SNMP 协议处于关闭状态,需要进入管理界面手动打开。同时,需要根据实际情况选择协议版本和填写相应信息。
Tip
有些设备为了安全需要额外配置放行 SNMP,具体因设备而异。比如华为系防火墙,需要在 "启用访问管理" 中勾选 SNMP 以放行。可以使用 snmpwalk
命令来测试采集侧与设备侧是否配置连通成功(在 Datakit 运行的主机上运行以下命令):
# 适用 v2c 版本
snmpwalk -O bentU -v 2c -c [community string] [SNMP_DEVICE_IP] 1.3.6
# 适用 v3 版本
snmpwalk -v 3 -u user -l authPriv -a sha -A [认证密码] -x aes -X [加密密码] [SNMP_DEVICE_IP] 1.3.6
如果配置没有问题的话,该命令会输出大量数据。snmpwalk
是运行在采集侧的一个测试工具,MacOS 下自带,Linux 安装方法:
- 在 DataKit 侧,配置采集。
高级功能¶
自定义设备的 OID 配置¶
如果你发现被采集的设备上报的数据中没有你想要的指标,那么,你可以需要为该设备额外定义一份 Profile。
设备的所有 OID 一般都可以在其官网上下载。Datakit 定义了一些通用的 OID,以及 Cisco/Dell/HP 等部分设备。根据 SNMP 协议,各设备生产商可以自定义 OID,用于标识其内部特殊对象。如果想要标识这些,你需要自定义设备的配置(我们这里称这种配置为 Profile,即 "自定义 Profile"),方法如下。
要增加指标或者自定义配置,需要列出 MIB name, table name, table OID, symbol 和 symbol OID,例如:
- MIB: EXAMPLE-MIB
table:
# Identification of the table which metrics come from.
OID: 1.3.6.1.4.1.10
name: exampleTable
symbols:
# List of symbols ('columns') to retrieve.
# Same format as for a single OID.
# Each row in the table emits these metrics.
- OID: 1.3.6.1.4.1.10.1.1
name: exampleColumn1
下面是一个操作示例。
在 Datakit 的安装目录的路径 conf.d/snmp/profiles
下,如下所示创建 yml 文件 cisco-3850.yaml
(这里以 Cisco 3850 为例):
# Backward compatibility shim. Prefer the Cisco Catalyst profile directly
# Profile for Cisco 3850 devices
extends:
- _base.yaml
- _cisco-generic.yaml
- _cisco-catalyst.yaml
sysobjectid: 1.3.6.1.4.1.9.1.1745 # cat38xxstack
device:
vendor: "cisco"
# Example sysDescr:
# Cisco IOS Software, IOS-XE Software, Catalyst L3 Switch Software (CAT3K_CAA-UNIVERSALK9-M), Version 03.06.06E RELEASE SOFTWARE (fc1) Technical Support: http://www.cisco.com/techsupport Copyright (c) 1986-2016 by Cisco Systems, Inc. Compiled Sat 17-Dec-
metadata:
device:
fields:
serial_number:
symbol:
MIB: OLD-CISCO-CHASSIS-MIB
OID: 1.3.6.1.4.1.9.3.6.3.0
name: info
metrics:
# iLO controller metrics.
- # Power state.
# NOTE: unknown(1), poweredOff(2), poweredOn(3), insufficientPowerOrPowerOnDenied(4)
MIB: CPQSM2-MIB
symbol:
OID: 1.3.6.1.4.1.232.9.2.2.32
name: temperature
如上所示,定义了一个 sysobjectid
为 1.3.6.1.4.1.9.1.1745
的设备,下次 Datakit 如果采集到 sysobjectid
相同的设备时,便会应用该文件,在此情况下:
- 采集到 OID 为
1.3.6.1.4.1.9.3.6.3.0
的数据时会把名称为serial_number
的字段加到device_meta
字段(JSON)里面,然后附加到指标集snmp_object
中作为 Object 上报; - 采集到 OID 为
1.3.6.1.4.1.232.9.2.2.32
的数据时把名称为temperature
的字段附加到指标集snmp_metric
中作为 Metric 上报;
Attention
conf.d/snmp/profiles
这个文件夹需要 SNMP 采集器运行一次后才会出现。
指标¶
以下所有数据采集,默认会追加名为 host
(值为 SNMP 设备的名称),也可以在配置中通过 [inputs.snmp.tags]
指定其它标签:
Attention
以下所有指标集以及其指标,只包含部分常见的字段,一些设备特定的字段,根据配置和设备型号不同,会额外多出一些字段。
snmp_metric
¶
SNMP device metric data.
- 标签
Tag | Description |
---|---|
cpu |
CPU index. Optional. |
device_vendor |
Device vendor. |
entity_name |
Device entity name. Optional. |
host |
Device host, replace with IP. |
interface |
Device interface. Optional. |
interface_alias |
Device interface alias. Optional. |
ip |
Device IP. |
mac_addr |
Device MAC address. Optional. |
mem |
Memory index. Optional. |
mem_pool_name |
Memory pool name. Optional. |
name |
Device name, replace with IP. |
power_source |
Power source. Optional. |
power_status_descr |
Power status description. Optional. |
sensor_id |
Sensor ID. Optional. |
sensor_type |
Sensor type. Optional. |
snmp_host |
Device host. |
snmp_profile |
Device SNMP profile file. |
temp_index |
Temperature index. Optional. |
temp_state |
Temperature state. Optional. |
- 字段列表
Metric | Description | Type | Unit |
---|---|---|---|
cieIfInputQueueDrops |
[Cisco only] (Shown as packet) The number of input packets dropped. | float | count |
cieIfLastInTime |
[Cisco only] (Shown as millisecond) The elapsed time in milliseconds since the last protocol input packet was received. | float | ms |
cieIfLastOutTime |
[Cisco only] (Shown as millisecond) The elapsed time in milliseconds since the last protocol output packet was transmitted. | float | ms |
cieIfOutputQueueDrops |
[Cisco only] (Shown as packet) The number of output packets dropped by the interface even though no error was detected to prevent them being transmitted. | float | count |
cieIfResetCount |
[Cisco only] The number of times the interface was internally reset and brought up. | float | count |
ciscoEnvMonFanState |
[Cisco only] The current state of the fan being instrumented. | float | count |
ciscoEnvMonSupplyState |
[Cisco only] The current state of the power supply being instrumented. | float | count |
ciscoEnvMonTemperatureStatusValue |
[Cisco only] The current value of the test point being instrumented. | float | count |
ciscoMemoryPoolFree |
[Cisco only] Indicates the number of bytes from the memory pool that are currently unused on the managed device. | float | count |
ciscoMemoryPoolLargestFree |
[Cisco only] Indicates the largest number of contiguous bytes from the memory pool that are currently unused on the managed device. | float | count |
ciscoMemoryPoolUsed |
[Cisco only] Indicates the number of bytes from the memory pool that are currently in use by applications on the managed device. | float | count |
cpmCPUTotal1minRev |
[Cisco only] [Shown as percent] The overall CPU busy percentage in the last 1 minute period. | float | percent |
cpmCPUTotalMonIntervalValue |
[Cisco only] (Shown as percent) The overall CPU busy percentage in the last cpmCPUMonInterval period. | float | percent |
cpuUsage |
(Shown as percent) Percentage of CPU currently being used. | float | percent |
cswStackPortOperStatus |
[Cisco only] The state of the stack port. | float | count |
cswSwitchState |
[Cisco only] The current state of a switch. | float | count |
entSensorValue |
[Cisco only] The most recent measurement seen by the sensor. | float | count |
ifAdminStatus |
The desired state of the interface. | float | - |
ifBandwidthInUsageRate |
(Shown as percent) The percent rate of used received bandwidth. | float | percent |
ifBandwidthOutUsageRate |
(Shown as percent) The percent rate of used sent bandwidth. | float | percent |
ifHCInBroadcastPkts |
(Shown as packet) The number of packets delivered by this sub-layer to a higher (sub-)layer that were addressed to a broadcast address at this sub-layer. | float | count |
ifHCInMulticastPkts |
(Shown as packet) The number of packets delivered by this sub-layer to a higher (sub-)layer which were addressed to a multicast address at this sub-layer. | float | count |
ifHCInOctets |
(Shown as byte) The total number of octets received on the interface including framing characters. | float | count |
ifHCInOctetsRate |
(Shown as byte) The total number of octets received on the interface including framing characters. | float | - |
ifHCInUcastPkts |
(Shown as packet) The number of packets delivered by this sub-layer to a higher (sub-)layer that were not addressed to a multicast or broadcast address at this sub-layer. | float | count |
ifHCOutBroadcastPkts |
(Shown as packet) The total number of packets that higher-level protocols requested be transmitted that were addressed to a broadcast address at this sub-layer, including those that were discarded or not sent. | float | count |
ifHCOutMulticastPkts |
(Shown as packet) The total number of packets that higher-level protocols requested be transmitted that were addressed to a multicast address at this sub-layer including those that were discarded or not sent. | float | count |
ifHCOutOctets |
(Shown as byte) The total number of octets transmitted out of the interface including framing characters. | float | count |
ifHCOutOctetsRate |
(Shown as byte) The total number of octets transmitted out of the interface including framing characters. | float | count |
ifHCOutUcastPkts |
(Shown as packet) The total number of packets higher-level protocols requested be transmitted that were not addressed to a multicast or broadcast address at this sub-layer including those that were discarded or not sent. | float | count |
ifHighSpeed |
An estimate of the interface's current bandwidth in units of 1,000,000 bits per second, or the nominal bandwidth. | float | count |
ifInDiscards |
(Shown as packet) The number of inbound packets chosen to be discarded even though no errors had been detected to prevent them being deliverable to a higher-layer protocol. | float | count |
ifInDiscardsRate |
(Shown as packet) The number of inbound packets chosen to be discarded even though no errors had been detected to prevent them being deliverable to a higher-layer protocol. | float | count |
ifInErrors |
(Shown as packet) The number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol. | float | count |
ifInErrorsRate |
(Shown as packet) The number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol. | float | count |
ifNumber |
Number of interface. | float | - |
ifOperStatus |
(Shown as packet) The current operational state of the interface. | float | count |
ifOutDiscards |
(Shown as packet) The number of outbound packets chosen to be discarded even though no errors had been detected to prevent them being transmitted. | float | count |
ifOutDiscardsRate |
(Shown as packet) The number of outbound packets chosen to be discarded even though no errors had been detected to prevent them being transmitted. | float | count |
ifOutErrors |
(Shown as packet) The number of outbound packets that could not be transmitted because of errors. | float | count |
ifOutErrorsRate |
(Shown as packet) The number of outbound packets that could not be transmitted because of errors. | float | count |
ifSpeed |
An estimate of the interface's current bandwidth in bits per second, or the nominal bandwidth. | float | count |
memoryFree |
(Shown as percent) The percentage of memory not being used. | float | percent |
memoryUsage |
(Shown as percent) The percentage of memory currently being used. | float | percent |
memoryUsed |
(Shown as byte) Number of bytes of memory currently being used. | float | count |
sysUpTimeInstance |
The time (in hundredths of a second) since the network management portion of the system was last re-initialized. | float | count |
tcpActiveOpens |
The number of times that TCP connections have made a direct transition to the SYN-SENT state from the CLOSED state. | float | count |
tcpAttemptFails |
The number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, or to the LISTEN state from the SYN-RCVD state. | float | count |
tcpCurrEstab |
The number of TCP connections for which the current state is either ESTABLISHED or CLOSE-WAIT. | float | - |
tcpEstabResets |
The number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state. | float | count |
tcpInErrs |
(Shown as segment) The total number of segments received in error (e.g., bad TCP checksums). | float | count |
tcpOutRsts |
(Shown as segment) The number of TCP segments sent containing the RST flag. | float | count |
tcpPassiveOpens |
(Shown as connection) The number of times TCP connections have made a direct transition to the SYN-RCVD state from the LISTEN state. | float | count |
tcpRetransSegs |
(Shown as segment) The total number of segments retransmitted; that is, the number of TCP segments transmitted containing one or more previously transmitted octets. | float | count |
udpInErrors |
(Shown as datagram) The number of received UDP datagram that could not be delivered for reasons other than the lack of an application at the destination port. | float | count |
udpNoPorts |
(Shown as datagram) The total number of received UDP datagram for which there was no application at the destination port. | float | count |
对象¶
snmp_object
¶
SNMP device object data.
- 标签
Tag | Description |
---|---|
device_vendor |
Device vendor. |
host |
Device host, replace with IP. |
ip |
Device IP. |
name |
Device name, replace with IP. |
snmp_host |
Device host. |
snmp_profile |
Device SNMP profile file. |
- 字段列表
Metric | Description | Type | Unit |
---|---|---|---|
all |
Device all data (JSON format). | string | - |
cpus |
Device CPUs (JSON format). | string | - |
device_meta |
Device meta data (JSON format). | string | - |
interfaces |
Device network interfaces (JSON format). | string | - |
mem_pool_names |
Device memory pool names (JSON format). | string | - |
mems |
Device memories (JSON format). | string | - |
sensors |
Device sensors (JSON format). | string | - |
FAQ¶
Datakit 是如何发现设备的?¶
Datakit 支持 "指定设备" 和 "自动发现" 两种模式。两种模式可以同时开启。
指定设备模式下,Datakit 与指定 IP 的设备使用 SNMP 协议进行通信,可以获知其目前在线状态。
自动发现模式下,Datakit 向指定 IP 网段内的所有地址逐一发送 SNMP 协议数据包,如果其响应可以匹配到相应的 Profile,那么 Datakit 认为该 IP 上有一个 SNMP 设备。
在观测云上看不到我想要的指标怎么办?¶
Datakit 可以从所有 SNMP 设备中收集通用的基线指标。如果你发现被采集的设备上报的数据中没有你想要的指标,那么,你可以需要为该设备自定义一份 Profile。
为了完成上述工作,你很可能需要从设备厂商的官网下载该设备型号的 OID 手册。
为什么开启 SNMP 设备采集但看不到指标?¶
尝试为你的设备放开 ACLs/防火墙 规则。
可以在运行 Datakit 的主机上运行命令 snmpwalk -O bentU -v 2c -c <COMMUNITY_STRING> <IP_ADDRESS>:<PORT> 1.3.6
。如果得到一个没有任何响应的超时,很可能是有什么东西阻止了 Datakit 从你的设备上收集指标。