GPU¶
SMI metric display: including GPU card temperature, clock, GPU occupancy rate, memory occupancy rate, memory occupancy of each running program in GPU, etc.
Configuration¶
Install Driver and CUDA Kit¶
See https://www.nvidia.com/Download/index.aspx
Collector Configuration¶
Go to the conf.d/gpu_smi
directory under the DataKit installation directory, copy gpu_smi.conf.sample
and name it gpu_smi.conf
. Examples are as follows:
[[inputs.gpu_smi]]
##(Optional) Collect interval, default is 10 seconds
interval = "10s"
##The binPath of gpu-smi
##If nvidia GPU
#(Example & default) bin_paths = ["/usr/bin/nvidia-smi"]
#(Example windows) bin_paths = ["nvidia-smi"]
##If lluvatar GPU
#(Example) bin_paths = ["/usr/local/corex/bin/ixsmi"]
#(Example) envs = [ "LD_LIBRARY_PATH=/usr/local/corex/lib/:$LD_LIBRARY_PATH" ]
##(Optional) Exec gpu-smi envs, default is []
#envs = [ "LD_LIBRARY_PATH=/usr/local/corex/lib/:$LD_LIBRARY_PATH" ]
##If remote GPU servers collected
##If use remote GPU servers, election must be true
##If use remote GPU servers, bin_paths should be shielded
#(Example) remote_addrs = ["192.168.1.1:22"]
#(Example) remote_users = ["remote_login_name"]
##If use remote_rsa_path, remote_passwords should be shielded
#(Example) remote_passwords = ["remote_login_password"]
#(Example) remote_rsa_paths = ["/home/your_name/.ssh/id_rsa"]
#(Example) remote_command = "nvidia-smi -x -q"
##(Optional) Exec gpu-smi timeout, default is 5 seconds
timeout = "5s"
##(Optional) Feed how much log data for ProcessInfos, default is 10. (0: 0 ,-1: all)
process_info_max_len = 10
##(Optional) GPU drop card warning delay, default is 300 seconds
gpu_drop_warning_delay = "300s"
## Set true to enable election
election = false
[inputs.gpu_smi.tags]
# some_tag = "some_value"
# more_tag = "some_other_value"
Attention
- DataKit can remotely collect GPU server indicators through SSH (when remote collection is enabled, the local configuration will be invalid).
- The number of
remote_addrs
configured can be more than the number ofremote_users
remote_passwords
remote_rsa_paths
.If not enough, it will match the first value. - Can be collected through
remote_addrs
+remote_users
+remote_passwords
. - It can also be collected through
remote_addrs
+remote_users
+remote_rsa_paths
. (remote_passwords
will be invalid after configuring the RSA public key). - After turning on remote collection, elections must be turned on. (Prevent multiple DataKit from uploading duplicate data).
- For security reasons, you can change the SSH port number or create a dedicated account for GPU remote collection.
After configuration, restart DataKit.
Can be turned on by ConfigMap Injection Collector Configuration or Config ENV_DATAKIT_INPUTS .
Can also be turned on by environment variables, (needs to be added as the default collector in ENV_DEFAULT_ENABLED_INPUTS):
-
ENV_INPUT_GPUSMI_INTERVAL
Collect interval
Type: TimeDuration
ConfField:
interval
Default: 10s
-
ENV_INPUT_GPUSMI_TIMEOUT
Timeout
Type: TimeDuration
ConfField:
timeout
Default: 5s
-
ENV_INPUT_GPUSMI_BIN_PATH
The binPath
Type: JSON
ConfField:
bin_path
Example:
["/usr/bin/nvidia-smi"]
-
ENV_INPUT_GPUSMI_PROCESS_INFO_MAX_LEN
Maximum number of GPU processes that consume the most resources
Type: Int
ConfField:
process_info_max_len
Default: 10
-
ENV_INPUT_GPUSMI_DROP_WARNING_DELAY
GPU card drop warning delay
Type: TimeDuration
ConfField:
gpu_drop_warning_delay
Default: 5m
-
ENV_INPUT_GPUSMI_ENVS
The envs of LD_LIBRARY_PATH
Type: JSON
ConfField:
envs
Example: ["LD_LIBRARY_PATH=/usr/local/corex/lib/:$LD_LIBRARY_PATH"]
-
ENV_INPUT_GPUSMI_REMOTE_ADDRS
If use remote GPU servers
Type: JSON
ConfField:
remote_addrs
Example: ["192.168.1.1:22","192.168.1.2:22"]
-
ENV_INPUT_GPUSMI_REMOTE_USERS
Remote login name
Type: JSON
ConfField:
remote_users
Example: ["user_1","user_2"]
-
ENV_INPUT_GPUSMI_REMOTE_PASSWORDS
Remote password
Type: JSON
ConfField:
remote_passwords
Example: ["pass_1","pass_2"]
-
ENV_INPUT_GPUSMI_REMOTE_RSA_PATHS
Remote rsa paths
Type: JSON
ConfField:
remote_rsa_paths
Example: ["/home/your_name/.ssh/id_rsa"]
-
ENV_INPUT_GPUSMI_REMOTE_COMMAND
Remote command
Type: String
ConfField:
remote_command
Example: "
nvidia-smi -x -q
" -
ENV_INPUT_GPUSMI_ELECTION
Enable election
Type: Boolean
ConfField:
election
Default: true
-
ENV_INPUT_GPUSMI_TAGS
Customize tags. If there is a tag with the same name in the configuration file, it will be overwritten
Type: Map
ConfField:
tags
Example: tag1=value1,tag2=value2
Metric¶
For all of the following data collections, a global tag named host
is appended by default (the tag value is the host name of the DataKit), or other tags can be specified in the configuration by [inputs.gpu_smi.tags]
:
gpu_smi
¶
- Tags
Tag | Description |
---|---|
compute_mode |
Compute mode |
cuda_version |
CUDA version |
driver_version |
Driver version |
host |
Host name |
name |
GPU card model |
pci_bus_id |
PCI bus id |
pstate |
GPU performance level |
uuid |
UUID |
- Metrics
Metric | Description | Type | Unit |
---|---|---|---|
clocks_current_graphics |
Graphics clock frequency. | int | MHz |
clocks_current_memory |
Memory clock frequency. | int | MHz |
clocks_current_sm |
Streaming Multiprocessor clock frequency. | int | MHz |
clocks_current_video |
Video clock frequency. | int | MHz |
encoder_stats_average_fps |
Encoder average fps. | int | - |
encoder_stats_average_latency |
Encoder average latency. | int | - |
encoder_stats_session_count |
Encoder session count. | int | count |
fan_speed |
Fan speed. | int | RPM% |
fbc_stats_average_fps |
Frame Buffer Cache average fps. | int | - |
fbc_stats_average_latency |
Frame Buffer Cache average latency. | int | - |
fbc_stats_session_count |
Frame Buffer Cache session count. | int | - |
memory_total |
Frame buffer memory total. | int | MB |
memory_used |
Frame buffer memory used. | int | MB |
pcie_link_gen_current |
PCI-Express link gen. | int | - |
pcie_link_width_current |
PCI link width. | int | - |
power_draw |
Power draw. | float | watt |
temperature_gpu |
GPU temperature. | int | C |
utilization_decoder |
Decoder utilization. | int | percent |
utilization_encoder |
Encoder utilization. | int | percent |
utilization_gpu |
GPU utilization. | int | percent |
utilization_memory |
Memory utilization. | int | percent |
DCGM Metrics Collection¶
- Operating system support:
DCGM indicator display: including GPU card temperature, clock, GPU occupancy rate, memory occupancy rate, etc.
DCGM Configuration¶
DCGM Metrics Preconditions¶
Install dcgm-exporter
, refer to here
DCGM Metrics Configuration¶
Go to the conf.d/Prom
directory under the DataKit installation directory, copy prom.conf.sample
and name it prom.conf
. Examples are as follows:
# {"version": "1.4.11-13-gd70f1f8ff7", "desc": "do NOT edit this line"}
[[inputs.prom]]
# Exporter URLs
# urls = ["http://127.0.0.1:9100/metrics", "http://127.0.0.1:9200/metrics"]
urls = ["http://127.0.0.1:9400/metrics"]
# Error ignoring request to url
ignore_req_err = false
# Collector alias
source = "prom"
# Collection data output source
# Configure this to write collected data to a local file instead of typing the data to the center
# You can debug the locally saved metric set directly with the datakit debug --prom-conf /path/to/this/conf command
# If url has been configured as the local file path, then --prom-conf takes precedence over debugging the data in the output path
# output = "/abs/path/to/file"
# Maximum size of data collected in bytes
# When outputting data to a local file, you can set the upper limit of the size of the collected data
# If the size of the collected data exceeds this limit, the collected data will be discarded
# The maximum size of collected data is set to 32MB by default
# max_file_size = 0
# Metrics type filtering, optional values are counter, gauge, histogram, summary and untyped
# Only counter and gauge metrics are collected by default
# If empty, no filtering is performed
metric_types = ["counter", "gauge"]
# Metric Name Filter: Eligible metrics will be retained
# Support regular can configure more than one, that is, satisfy one of them
# If blank, no filtering is performed and all metrics are retained
# metric_name_filter = ["cpu"]
# Measurement name prefix
# Configure this to prefix the measurement name
measurement_prefix = "gpu_"
# Measurement name
# By default, the measurement name will be cut with an underscore "_". The first field after cutting will be the measurement name, and the remaining fields will be the current metric name
# If measurement_name is configured, the metric name is not cut
# The final measurement name is prefixed with measurement_prefix
measurement_name = "dcgm"
# TLS configuration
tls_open = false
# tls_ca = "/tmp/ca.crt"
# tls_cert = "/tmp/peer.crt"
# tls_key = "/tmp/peer.key"
## Set to true to turn on election
election = true
# Filter tags, configurable multiple tags
# Matching tags will be ignored, but the corresponding data will still be reported
# tags_ignore = ["xxxx"]
#tags_ignore = ["host"]
# Custom authentication method, currently only supports Bearer Token
# token and token_file: Just configure one of them
# [inputs.prom.auth]
# type = "bearer_token"
# token = "xxxxxxxx"
# token_file = "/tmp/token"
# Custom measurement name
# You can group metrics that contain the prefix prefix into one measurement
# Custom measurement name configuration priority measurement_name Configuration Items
#[[inputs.prom.measurements]]
# prefix = "cpu_"
# name = "cpu"
# [[inputs.prom.measurements]]
# prefix = "mem_"
# name = "mem"
# For data that matches the following tag, discard the data and do not collect it
[inputs.prom.ignore_tag_kv_match]
# key1 = [ "val1.*", "val2.*"]
# key2 = [ "val1.*", "val2.*"]
# Add additional request headers to HTTP requests for data fetches
[inputs.prom.http_headers]
# Root = "passwd"
# Michael = "1234"
# Rename tag key in prom data
[inputs.prom.tags_rename]
overwrite_exist_tags = false
[inputs.prom.tags_rename.mapping]
Hostname = "host"
# tag1 = "new-name-1"
# tag2 = "new-name-2"
# tag3 = "new-name-3"
# Call the collected metrics to the center as logs
# When the service field is left blank, the service tag is set to measurement name
[inputs.prom.as_logging]
enable = false
service = "service_name"
# Customize Tags
[inputs.prom.tags]
# some_tag = "some_value"
# more_tag = "some_other_value"
After configuration, restart DataKit.
DCGM Metrics¶
Metrics | Description | Data Type |
---|---|---|
DCGM_FI_DEV_DEC_UTIL | gauge, Decoder utilization (in %). | int |
DCGM_FI_DEV_ENC_UTIL | gauge, Encoder utilization (in %). | int |
DCGM_FI_DEV_FB_FREE | gauge, Frame buffer memory free (in MiB). | int |
DCGM_FI_DEV_FB_USED | gauge, Frame buffer memory used (in MiB). | int |
DCGM_FI_DEV_GPU_TEMP | gauge, GPU temperature (in C). | int |
DCGM_FI_DEV_GPU_UTIL | gauge, GPU utilization (in %). | int |
DCGM_FI_DEV_MEM_CLOCK | gauge, Memory clock frequency (in MHz). | int |
DCGM_FI_DEV_MEM_COPY_UTIL | gauge, Memory utilization (in %). | int |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | counter, Total number of NVLink bandwidth counters for all lanes. | int |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | counter, Total number of PCIe retries. | int |
DCGM_FI_DEV_SM_CLOCK | gauge, SM clock frequency (in MHz). | int |
DCGM_FI_DEV_VGPU_LICENSE_STATUS | gauge, vGPU License status | int |
DCGM_FI_DEV_XID_ERRORS | gauge, Value of the last XID error encountered. | int |