You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

4.1 KiB

1# Format
2# If line starts with a '#' it is considered a comment
3# DCGM FIELD, Prometheus metric type, help message
4# Clocks
5DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
6DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
7# Temperature
8DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
9DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
10# Power
11DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
12DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
13# PCIE
14DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
15DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
16DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
17# Utilization (the sample period varies depending on the product)
18DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
19DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
20DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
21DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
22# Errors and violations
23DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
24# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
25# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
26# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
27# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
28# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
29# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
30# Memory usage
31DCGM_FI_DEV_FB_FREE, gauge, Frame buffer memory free (in MB).
32DCGM_FI_DEV_FB_USED, gauge, Frame buffer memory used (in MB).
33# ECC
34# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
35# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
36# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
37# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
38# Retired pages
39# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
40# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
41# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
42# NVLink
43# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
44# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
45# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
46# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
47DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes
48# VGPU License status
49DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
50# Remapped rows
51DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
52DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
53DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
54# Static configuration information. These appear as labels on the other metrics
55DCGM_FI_DRIVER_VERSION, label, Driver Version
56# DCGM_FI_NVML_VERSION, label, NVML Version
57# DCGM_FI_DEV_BRAND, label, Device Brand
58# DCGM_FI_DEV_SERIAL, label, Device Serial Number
59# DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version
60# DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version
61# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
62# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
63# DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device