GPU
采集包括 GPU 温度、时钟、GPU 占用率、内存占用率、GPU 内每个运行程序的内存占用等。
配置¶
安装 驱动及 CUDA 工具包¶
参考网址 https://www.nvidia.com/Download/index.aspx
采集器配置¶
进入 DataKit 安装目录下的 conf.d/gpu_smi
目录,复制 gpu_smi.conf.sample
并命名为 gpu_smi.conf
。示例如下:
[[inputs.gpu_smi]]
##(Optional) Collect interval, default is 10 seconds
interval = "10s"
##The binPath of gpu-smi
##If nvidia GPU
#(Example & default) bin_paths = ["/usr/bin/nvidia-smi"]
#(Example windows) bin_paths = ["nvidia-smi"]
##If lluvatar GPU
#(Example) bin_paths = ["/usr/local/corex/bin/ixsmi"]
#(Example) envs = [ "LD_LIBRARY_PATH=/usr/local/corex/lib/:$LD_LIBRARY_PATH" ]
##(Optional) Exec gpu-smi envs, default is []
#envs = [ "LD_LIBRARY_PATH=/usr/local/corex/lib/:$LD_LIBRARY_PATH" ]
##If remote GPU servers collected
##If use remote GPU servers, election must be true
##If use remote GPU servers, bin_paths should be shielded
#(Example) remote_addrs = ["192.168.1.1:22"]
#(Example) remote_users = ["remote_login_name"]
##If use remote_rsa_path, remote_passwords should be shielded
#(Example) remote_passwords = ["remote_login_password"]
#(Example) remote_rsa_paths = ["/home/your_name/.ssh/id_rsa"]
#(Example) remote_command = "nvidia-smi -x -q"
##(Optional) Exec gpu-smi timeout, default is 5 seconds
timeout = "5s"
##(Optional) Feed how much log data for ProcessInfos, default is 10. (0: 0 ,-1: all)
process_info_max_len = 10
##(Optional) GPU drop card warning delay, default is 300 seconds
gpu_drop_warning_delay = "300s"
## Set true to enable election
election = false
[inputs.gpu_smi.tags]
# some_tag = "some_value"
# more_tag = "some_other_value"
Attention
- Datakit 可以通过 SSH 远程采集 GPU 服务器的指标(开启远程采集后,本地采集配置将失效)。
remote_addrs
配置的个数可以多于remote_users
remote_passwords
remote_rsa_paths
个数,不够的匹配排位第一的数值。- 可以通过
remote_addrs
+remote_users
+remote_passwords
采集。 - 也可以通过
remote_addrs
+remote_users
+remote_rsa_paths
采集。(配置 RSA 公钥后,remote_passwords
将失效)。 - 开启远程采集后,必须开启选举。(防止多个 Datakit 上传重复数据)。
- 出于安全考虑,可以变更 SSH 端口号,也可以单独为 GPU 远程采集创建专用的账户。
配置好后,重启 DataKit 即可。
可通过 ConfigMap 方式注入采集器配置 或 配置 ENV_DATAKIT_INPUTS 开启采集器。
也支持以环境变量的方式修改配置参数(需要在 ENV_DEFAULT_ENABLED_INPUTS 中加为默认采集器):
-
ENV_INPUT_GPUSMI_INTERVAL
采集器重复间隔时长
字段类型: Duration
采集器配置字段:
interval
默认值: 10s
-
ENV_INPUT_GPUSMI_TIMEOUT
超时时长
字段类型: Duration
采集器配置字段:
timeout
默认值: 5s
-
ENV_INPUT_GPUSMI_BIN_PATH
执行文件路径
字段类型: JSON
采集器配置字段:
bin_path
示例:
["/usr/bin/nvidia-smi"]
-
ENV_INPUT_GPUSMI_PROCESS_INFO_MAX_LEN
最大收集最耗资源 GPU 进程数
字段类型: Int
采集器配置字段:
process_info_max_len
默认值: 10
-
ENV_INPUT_GPUSMI_DROP_WARNING_DELAY
掉卡告警延迟
字段类型: Duration
采集器配置字段:
gpu_drop_warning_delay
默认值: 5m
-
ENV_INPUT_GPUSMI_ENVS
执行依赖库的路径
字段类型: JSON
采集器配置字段:
envs
示例: ["LD_LIBRARY_PATH=/usr/local/corex/lib/:$LD_LIBRARY_PATH"]
-
ENV_INPUT_GPUSMI_REMOTE_ADDRS
远程 GPU 服务器
字段类型: JSON
采集器配置字段:
remote_addrs
示例: ["192.168.1.1:22","192.168.1.2:22"]
-
ENV_INPUT_GPUSMI_REMOTE_USERS
远程登录名
字段类型: JSON
采集器配置字段:
remote_users
示例: ["user_1","user_2"]
-
ENV_INPUT_GPUSMI_REMOTE_PASSWORDS
远程登录密码
字段类型: JSON
采集器配置字段:
remote_passwords
示例: ["pass_1","pass_2"]
-
ENV_INPUT_GPUSMI_REMOTE_RSA_PATHS
秘钥文件路径
字段类型: JSON
采集器配置字段:
remote_rsa_paths
示例: ["/home/your_name/.ssh/id_rsa"]
-
ENV_INPUT_GPUSMI_REMOTE_COMMAND
远程执行指令
字段类型: String
采集器配置字段:
remote_command
示例: "
nvidia-smi -x -q
" -
ENV_INPUT_GPUSMI_ELECTION
开启选举
字段类型: Boolean
采集器配置字段:
election
默认值: true
-
ENV_INPUT_GPUSMI_TAGS
自定义标签。如果配置文件有同名标签,将会覆盖它
字段类型: Map
采集器配置字段:
tags
示例: tag1=value1,tag2=value2
指标字段¶
以下所有数据采集,默认会追加全局选举 tag,也可以在配置中通过 [inputs.gpu_smi.tags]
指定其它标签:
gpu_smi
¶
- 标签
Tag | Description |
---|---|
compute_mode |
Compute mode |
cuda_version |
CUDA version |
driver_version |
Driver version |
host |
Host name |
name |
GPU card model |
pci_bus_id |
PCI bus id |
pstate |
GPU performance level |
uuid |
UUID |
- 指标列表
Metric | Description | Type | Unit |
---|---|---|---|
clocks_current_graphics |
Graphics clock frequency. | int | MHz |
clocks_current_memory |
Memory clock frequency. | int | MHz |
clocks_current_sm |
Streaming Multiprocessor clock frequency. | int | MHz |
clocks_current_video |
Video clock frequency. | int | MHz |
encoder_stats_average_fps |
Encoder average fps. | int | - |
encoder_stats_average_latency |
Encoder average latency. | int | - |
encoder_stats_session_count |
Encoder session count. | int | count |
fan_speed |
Fan speed. | int | RPM% |
fbc_stats_average_fps |
Frame Buffer Cache average fps. | int | - |
fbc_stats_average_latency |
Frame Buffer Cache average latency. | int | - |
fbc_stats_session_count |
Frame Buffer Cache session count. | int | - |
memory_total |
Frame buffer memory total. | int | MB |
memory_used |
Frame buffer memory used. | int | MB |
pcie_link_gen_current |
PCI-Express link gen. | int | - |
pcie_link_width_current |
PCI link width. | int | - |
power_draw |
Power draw. | float | watt |
temperature_gpu |
GPU temperature. | int | C |
utilization_decoder |
Decoder utilization. | int | percent |
utilization_encoder |
Encoder utilization. | int | percent |
utilization_gpu |
GPU utilization. | int | percent |
utilization_memory |
Memory utilization. | int | percent |
DCGM 指标采集¶
- 操作系统支持:
DCGM 指标包括 GPU 卡温度、时钟、GPU 占用率、内存占用率等。
DCGM 配置¶
DCGM 指标前置条件¶
安装 dcgm-exporter
,参考NVIDIA 官网
DCGM 采集配置¶
进入 DataKit 安装目录下的 conf.d/prom
目录,复制 prom.conf.sample
并命名为 prom.conf
。示例如下:
[[inputs.prom]]
## Exporter URLs
urls = ["http://127.0.0.1:9400/metrics"]
## 忽略对 URL 的请求错误
ignore_req_err = false
## 采集器别名
source = "dcgm"
## 采集数据输出源
## 配置此项,可以将采集到的数据写到本地文件而不将数据打到中心
## 之后可以直接用 datakit debug --prom-conf /path/to/this/conf 命令对本地保存的指标集进行调试
## 如果已经将 URL 配置为本地文件路径,则 --prom-conf 优先调试 output 路径的数据
# output = "/abs/path/to/file"
## 采集数据大小上限,单位为字节
## 将数据输出到本地文件时,可以设置采集数据大小上限
## 如果采集数据的大小超过了此上限,则采集的数据将被丢弃
## 采集数据大小上限默认设置为 32MB
# max_file_size = 0
## 指标类型过滤,可选值为 counter/gauge/histogram/summary/untyped
## 默认只采集 counter 和 gauge 类型的指标
## 如果为空,则不进行过滤
# metric_types = ["counter", "gauge"]
## 指标名称筛选:符合条件的指标将被保留下来
## 支持正则,可以配置多个,即满足其中之一即可
## 如果为空,则不进行筛选,所有指标均保留
# metric_name_filter = ["cpu"]
## 指标集名称前缀
## 配置此项,可以给指标集名称添加前缀
measurement_prefix = "gpu_"
## 指标集名称
## 默认会将指标名称以下划线 "_" 进行切割,切割后的第一个字段作为指标集名称,剩下字段作为当前指标名称
## 如果配置 measurement_name, 则不进行指标名称的切割
## 最终的指标集名称会添加上 measurement_prefix 前缀
measurement_name = "dcgm"
## TLS 配置
# tls_open = false
# tls_ca = "/tmp/ca.crt"
# tls_cert = "/tmp/peer.crt"
# tls_key = "/tmp/peer.key"
## 设置为 true 以开启选举功能
election = true
## 过滤 tags, 可配置多个 tag
## 匹配的 tag 将被忽略,但对应的数据仍然会上报上来
# tags_ignore = ["xxxx"]
## 自定义认证方式,目前仅支持 Bearer Token
## token 和 token_file: 仅需配置其中一项即可
# [inputs.prom.auth]
# type = "bearer_token"
# token = "xxxxxxxx"
# token_file = "/tmp/token"
## 自定义指标集名称
## 可以将包含前缀 prefix 的指标归为一类指标集
## 自定义指标集名称配置优先 measurement_name 配置项
# [[inputs.prom.measurements]]
# prefix = "cpu_"
# name = "cpu"
# [[inputs.prom.measurements]]
# prefix = "mem_"
# name = "mem"
## 对于匹配如下 tag 相关的数据,丢弃这些数据不予采集
# [inputs.prom.ignore_tag_kv_match]
# key1 = [ "val1.*", "val2.*"]
# key2 = [ "val1.*", "val2.*"]
## 在数据拉取的 HTTP 请求中添加额外的请求头(例如 Basic 认证)
# [inputs.prom.http_headers]
# Authorization = “Basic bXl0b21jYXQ="
## 重命名 prom 数据中的 tag key
[inputs.prom.tags_rename]
overwrite_exist_tags = false
[inputs.prom.tags_rename.mapping]
Hostname = "host"
# tag1 = "new-name-1"
# tag2 = "new-name-2"
## 将采集到的指标作为日志打到中心
## service 字段留空时,会把 service tag 设为指标集名称
[inputs.prom.as_logging]
enable = false
service = "service_name"
## 自定义 Tags
# [inputs.prom.tags]
# some_tag = "some_value"
# more_tag = "some_other_value"
配置好后,重启 DataKit 即可。
DCGM 指标¶
gpu_dcgm
¶
- 标签
Tag | Description |
---|---|
gpu | GPU id. |
device | device. |
modelName | GPU model. |
Hostname | host name. |
host | Instance endpoint. |
UUID | UUID. |
DCGM_FI_NVML_VERSION | NVML Version. |
DCGM_FI_DEV_BRAND | Device Brand. |
DCGM_FI_DEV_SERIAL | Device Serial Number. |
DCGM_FI_DEV_OEM_INFOROM_VER | OEM inforom version. |
DCGM_FI_DEV_ECC_INFOROM_VER | ECC inforom version. |
DCGM_FI_DEV_POWER_INFOROM_VER | Power management object inforom version. |
DCGM_FI_DEV_INFOROM_IMAGE_VER | Inforom image version. |
DCGM_FI_DEV_VBIOS_VERSION | VBIOS version of the device. |
- 指标列表
Metric | Unit | Description |
---|---|---|
DCGM_FI_DEV_SM_CLOCK | gauge | SM clock frequency (in MHz). |
DCGM_FI_DEV_MEM_CLOCK | gauge | Memory clock frequency (in MHz). |
DCGM_FI_DEV_MEMORY_TEMP | gauge | Memory temperature (in C). |
DCGM_FI_DEV_GPU_TEMP | gauge | GPU temperature (in C). |
DCGM_FI_DEV_POWER_USAGE | gauge | Power draw (in W). |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | counter | Total energy consumption since boot (in mJ). |
DCGM_FI_DEV_PCIE_TX_THROUGHPUT | counter | Total number of bytes transmitted through PCIe TX (in KB) via NVML . |
DCGM_FI_DEV_PCIE_RX_THROUGHPUT | counter | Total number of bytes received through PCIe RX (in KB) via NVML . |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | counter | Total number of PCIe retries. |
DCGM_FI_DEV_GPU_UTIL | gauge | GPU utilization (in %). |
DCGM_FI_DEV_MEM_COPY_UTIL | gauge | Memory utilization (in %). |
DCGM_FI_DEV_ENC_UTIL | gauge | Encoder utilization (in %). |
DCGM_FI_DEV_DEC_UTIL | gauge | Decoder utilization (in %). |
DCGM_FI_DEV_XID_ERRORS | gauge | Value of the last XID error encountered. |
DCGM_FI_DEV_POWER_VIOLATION | counter | Throttling duration due to power constraints (in us). |
DCGM_FI_DEV_THERMAL_VIOLATION | counter | Throttling duration due to thermal constraints (in us). |
DCGM_FI_DEV_SYNC_BOOST_VIOLATION | counter | Throttling duration due to sync-boost constraints (in us). |
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION | counter | Throttling duration due to board limit constraints (in us). |
DCGM_FI_DEV_LOW_UTIL_VIOLATION | counter | Throttling duration due to low utilization (in us). |
DCGM_FI_DEV_RELIABILITY_VIOLATION | counter | Throttling duration due to reliability constraints (in us). |
DCGM_FI_DEV_FB_FREE | gauge | Framebuffer memory free (in MiB). |
DCGM_FI_DEV_FB_USED | gauge | Framebuffer memory used (in MiB). |
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL | counter | Total number of single-bit volatile ECC errors. |
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | counter | Total number of double-bit volatile ECC errors. |
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL | counter | Total number of single-bit persistent ECC errors. |
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL | counter | Total number of double-bit persistent ECC errors. |
DCGM_FI_DEV_RETIRED_SBE | counter | Total number of retired pages due to single-bit errors. |
DCGM_FI_DEV_RETIRED_DBE | counter | Total number of retired pages due to double-bit errors. |
DCGM_FI_DEV_RETIRED_PENDING | counter | Total number of pages pending retirement. |
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL | counter | Total number of NVLink flow-control CRC errors. |
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL | counter | Total number of NVLink data CRC errors. |
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL | counter | Total number of NVLink retries. |
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL | counter | Total number of NVLink recovery errors. |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | counter | Total number of NVLink bandwidth counters for all lanes. |
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0 | counter | The number of bytes of active NVLink rx or tx data including both header and payload. |
DCGM_FI_DEV_VGPU_LICENSE_STATUS | gauge | vGPU License status. |
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS | counter | Number of remapped rows for uncorrectable errors. |
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS | counter | Number of remapped rows for correctable errors. |
DCGM_FI_DEV_ROW_REMAP_FAILURE | gauge | Whether remapping of rows has failed. |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | gauge | Ratio of time the graphics engine is active (in %). |
DCGM_FI_PROF_SM_ACTIVE | gauge | The ratio of cycles an SM has at least 1 warp assigned (in %). |
DCGM_FI_PROF_SM_OCCUPANCY | gauge | The ratio of number of warps resident on an SM (in %). |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | gauge | Ratio of cycles the tensor (HMMA ) pipe is active (in %). |
DCGM_FI_PROF_DRAM_ACTIVE | gauge | Ratio of cycles the device memory interface is active sending or receiving data (in %). |
DCGM_FI_PROF_PIPE_FP64_ACTIVE | gauge | Ratio of cycles the fp64 pipes are active (in %). |
DCGM_FI_PROF_PIPE_FP32_ACTIVE | gauge | Ratio of cycles the fp32 pipes are active (in %). |
DCGM_FI_PROF_PIPE_FP16_ACTIVE | gauge | Ratio of cycles the fp16 pipes are active (in %). |
DCGM_FI_PROF_PCIE_TX_BYTES | gauge | The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per .second. |
DCGM_FI_PROF_PCIE_RX_BYTES | gauge | The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per .second. |
DCGM_FI_DRIVER_VERSION | label | Driver Version. |