first commit

1 year ago · 24bdefa8e6
commit 24bdefa8e6
1 changed files with 171 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,171 @@
+# DCGM-Exporter
+
+This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for [Prometheus](https://prometheus.io/) leveraging [NVIDIA DCGM](https://developer.nvidia.com/dcgm).
+
+### Documentation
+
+Official documentation for DCGM-Exporter can be found on [docs.nvidia.com](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html).
+
+### Quickstart
+
+To gather metrics on a GPU node, simply start the `dcgm-exporter` container:
+```
+$ docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.7-ubuntu20.04
+$ curl localhost:9400/metrics
+# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
+# TYPE DCGM_FI_DEV_SM_CLOCK gauge
+# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
+# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
+# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
+# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
+...
+DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
+DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
+DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
+...
+```
+
+### Quickstart on Kubernetes
+
+Note: Consider using the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) rather than DCGM-Exporter directly.
+
+Ensure you have already setup your cluster with the [default runtime as NVIDIA](https://github.com/NVIDIA/nvidia-container-runtime#docker-engine-setup).
+
+The recommended way to install DCGM-Exporter is to use the Helm chart: 
+```
+$ helm repo add gpu-helm-charts \
+  https://nvidia.github.io/dcgm-exporter/helm-charts
+```
+Update the repo:
+```
+$ helm repo update
+```
+And install the chart:
+```
+$ helm install \ 
+    --generate-name \ 
+    gpu-helm-charts/dcgm-exporter
+```
+
+Once the `dcgm-exporter` pod is deployed, you can use port forwarding to obtain metrics quickly:
+
+
+```
+$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml
+
+# Let's get the output of a random pod:
+$ NAME=$(kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" \
+                         -o "jsonpath={ .items[0].metadata.name}")
+
+$ kubectl port-forward $NAME 8080:9400 &
+$ curl -sL http://127.0.0.1:8080/metrics
+# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
+# TYPE DCGM_FI_DEV_SM_CLOCK gauge
+# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
+# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
+# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
+# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
+...
+DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 139
+DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 405
+DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 9223372036854775794
+...
+
+```
+To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the [user guide](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#gpu-telemetry). 
+`dcgm-exporter` is deployed as part of the GPU Operator. To get started with integrating with Prometheus, check the Operator [user guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#gpu-telemetry).
+
+### Building from Source
+
+In order to build dcgm-exporter ensure you have the following:
+- [Golang >= 1.14 installed](https://golang.org/)
+- [DCGM installed](https://developer.nvidia.com/dcgm)
+
+```
+$ git clone https://github.com/NVIDIA/dcgm-exporter.git
+$ cd dcgm-exporter
+$ make binary
+$ sudo make install
+...
+$ dcgm-exporter &
+$ curl localhost:9400/metrics
+# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
+# TYPE DCGM_FI_DEV_SM_CLOCK gauge
+# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
+# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
+# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
+# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
+...
+DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
+DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
+DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
+...
+```
+
+### Changing Metrics
+
+With `dcgm-exporter` you can configure which fields are collected by specifying a custom CSV file.
+You will find the default CSV file under `etc/default-counters.csv` in the repository, which is copied on your system or container to `/etc/dcgm-exporter/default-counters.csv`
+
+The layout and format of this file is as follows:
+```
+# Format
+# If line starts with a '#' it is considered a comment
+# DCGM FIELD, Prometheus metric type, help message
+
+# Clocks
+DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
+DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
+```
+
+A custom csv file can be specified using the `-f` option or `--collectors` as follows:
+```
+$ dcgm-exporter -f /tmp/custom-collectors.csv
+```
+
+Notes:
+- Always make sure your entries have 2 commas (',')
+- The complete list of counters that can be collected can be found on the DCGM API reference manual: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
+
+### What about a Grafana Dashboard?
+
+You can find the official NVIDIA DCGM-Exporter dashboard here: https://grafana.com/grafana/dashboards/12239
+
+You will also find the `json` file on this repo under `grafana/dcgm-exporter-dashboard.json`
+
+Pull requests are accepted!
+
+
+### Building the containers
+
+This project uses [docker buildx](https://docs.docker.com/buildx/working-with-buildx/) for multi-arch image creation. Follow the instructions on that page to get a working builder instance for creating these containers. Some other useful build options follow.
+
+Builds local images based on the machine architecture and makes them available in 'docker images'
+```
+make local
+```
+
+Build the ubuntu image and export to 'docker images'
+```
+make ubuntu20.04 PLATFORMS=linux/amd64 OUTPUT=type=docker
+```
+
+Build and push the images to some other 'private_registry'
+```
+make REGISTRY=<private_registry> push
+```
+
+## Issues and Contributing
+
+[Checkout the Contributing document!](CONTRIBUTING.md)
+
+* Please let us know by [filing a new issue](https://github.com/NVIDIA/dcgm-exporter/issues/new)
+* You can contribute by opening a [pull request](https://github.com/NVIDIA/dcgm-exporter)
+
+### Reporting Security Issues
+
+We ask that all community members and users of DCGM Exporter follow the standard NVIDIA process for reporting security vulnerabilities. This process is documented at the [NVIDIA Product Security](https://www.nvidia.com/en-us/security/) website.
+Following the process will result in any needed CVE being created as well as appropriate notifications being communicated
+to the entire DCGM Exporter community. NVIDIA reserves the right to delete vulnerability reports until they're fixed.
+
+Please refer to the policies listed there to answer questions related to reporting security issues.