You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
4.1 KiB
4.1 KiB
1 | # Format |
---|---|
2 | # If line starts with a '#' it is considered a comment |
3 | # DCGM FIELD, Prometheus metric type, help message |
4 | # Clocks |
5 | DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz). |
6 | DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz). |
7 | # Temperature |
8 | DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C). |
9 | DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C). |
10 | # Power |
11 | DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W). |
12 | DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ). |
13 | # PCIE |
14 | DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML. |
15 | DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML. |
16 | DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries. |
17 | # Utilization (the sample period varies depending on the product) |
18 | DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %). |
19 | DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %). |
20 | DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %). |
21 | DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %). |
22 | # Errors and violations |
23 | DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered. |
24 | # DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us). |
25 | # DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us). |
26 | # DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us). |
27 | # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us). |
28 | # DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us). |
29 | # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us). |
30 | # Memory usage |
31 | DCGM_FI_DEV_FB_FREE, gauge, Frame buffer memory free (in MB). |
32 | DCGM_FI_DEV_FB_USED, gauge, Frame buffer memory used (in MB). |
33 | # ECC |
34 | # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors. |
35 | # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors. |
36 | # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors. |
37 | # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors. |
38 | # Retired pages |
39 | # DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors. |
40 | # DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors. |
41 | # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement. |
42 | # NVLink |
43 | # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors. |
44 | # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors. |
45 | # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries. |
46 | # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors. |
47 | DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes |
48 | # VGPU License status |
49 | DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status |
50 | # Remapped rows |
51 | DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors |
52 | DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors |
53 | DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed |
54 | # Static configuration information. These appear as labels on the other metrics |
55 | DCGM_FI_DRIVER_VERSION, label, Driver Version |
56 | # DCGM_FI_NVML_VERSION, label, NVML Version |
57 | # DCGM_FI_DEV_BRAND, label, Device Brand |
58 | # DCGM_FI_DEV_SERIAL, label, Device Serial Number |
59 | # DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version |
60 | # DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version |
61 | # DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version |
62 | # DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version |
63 | # DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device |