Instrumenting a Slurm batch script for GPU monitoring¶
Rather than sampling GPU metrics interactively, you can instrument your Slurm submission script to collect statistics automatically for the full duration of the job. Add the following near the top of your script, after your #SBATCH directives:
# Start collecting GPU stats in the background
nvidia-smi \
--query-gpu=timestamp,uuid,clocks_throttle_reasons.sw_thermal_slowdown,\
utilization.gpu,utilization.memory,memory.used,memory.total,\
temperature.gpu,power.draw,clocks.current.sm \
--format=csv,nounits \
-l 5 \
-f gpu-stats-${SLURM_JOB_ID}.csv &
Index
The -l 5 flag polls every 5 seconds and writes each sample as a CSV row to gpu-stats-${SLURM_JOB_ID}.out. Because the process is launched with & it runs in the background alongside your workload and is automatically terminated when the job ends.
The fields collected are:
| Field | Description |
|---|---|
timestamp |
Wall-clock time of the sample |
uuid |
GPU device UUID |
clocks_throttle_reasons.sw_thermal_slowdown |
Whether the GPU is throttling due to temperature |
utilization.gpu |
GPU utilisation (%) — see note below |
utilization.memory |
Memory bus utilisation (%) — see note below |
memory.used / memory.total |
Device memory in MiB |
temperature.gpu |
GPU die temperature in °C |
power.draw |
Instantaneous power draw in W |
clocks.current.sm |
SM clock frequency in MHz |
Sample Slurm script
#!/bin/bash
#SBATCH --job-name gpu-burn
#SBATCH --cpus-per-task 2
#SBATCH --partition gpu_interactive
#SBATCH --gpus-per-node 1
#SBATCH --mem 4G
#SBATCH --time 00:10:00
#SBATCH --output slog/%j.out
module purge
module load CUDA/12.6.0 GCC/12.3.0
# start collecting GPU stats in the background
nvidia-smi --query-gpu=timestamp,uuid,clocks_throttle_reasons.sw_thermal_slowdown,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw,clocks.current.sm \
--format=csv,nounits \
-l 5 -f gpu-stats-${SLURM_JOB_ID}.csv &
./gpu_burn -tc -m 80% 600
Visualising the GPU stats CSV¶
Once your job completes, the gpu-stats-${SLURM_JOB_ID}.csv CSV can be passed to the plotting script to generate an interactive HTML report:
- The visualisation script can be found here
- Edit the filename on line 12 with your filename .i.e.
fn = "gpu-stats-15140945.csv"
This produces a self-contained HTML file with time-series plots for utilisation, memory usage, temperature, power draw, and SM clock frequency across the full job duration. An example output is embedded below.
What "utilisation" actually means¶
Take care when interpreting the utilization.gpu and utilization.memory columns. Per the nvidia-smi documentation, these are time-based metrics, not capacity-based:
- GPU utilisation — the percentage of the sample period during which at least one kernel was executing on the GPU.
- Memory utilisation — the percentage of the sample period during which global device memory was being read or written.
This means a single small kernel that runs continuously but uses only a tiny fraction of the GPU's compute capacity will still report 100% GPU utilisation. High utilisation is therefore a necessary but not sufficient indicator of efficient GPU use. If you see high utilisation alongside low throughput, the likely cause is a kernel that is poorly optimised for the hardware — a profiler such as Nsight Systems will give you the next level of detail.