Instrumenting a Slurm batch script for GPU monitoring¶

Rather than sampling GPU metrics interactively, you can instrument your Slurm submission script to collect statistics automatically for the full duration of the job. Add the following near the top of your script, after your #SBATCH directives:

# Start collecting GPU stats in the background
nvidia-smi \
  --query-gpu=timestamp,uuid,clocks_throttle_reasons.sw_thermal_slowdown,\
utilization.gpu,utilization.memory,memory.used,memory.total,\
temperature.gpu,power.draw,clocks.current.sm \
  --format=csv,nounits \
  -l 5 \
  -f gpu-stats-${SLURM_JOB_ID}.csv &

Index

The -l 5 flag polls every 5 seconds and writes each sample as a CSV row to gpu-stats-${SLURM_JOB_ID}.out. Because the process is launched with & it runs in the background alongside your workload and is automatically terminated when the job ends.

The fields collected are:

Field	Description
`timestamp`	Wall-clock time of the sample
`uuid`	GPU device UUID
`clocks_throttle_reasons.sw_thermal_slowdown`	Whether the GPU is throttling due to temperature
`utilization.gpu`	GPU utilisation (%) — see note below
`utilization.memory`	Memory bus utilisation (%) — see note below
`memory.used` / `memory.total`	Device memory in MiB
`temperature.gpu`	GPU die temperature in °C
`power.draw`	Instantaneous power draw in W
`clocks.current.sm`	SM clock frequency in MHz

Sample Slurm script

#!/bin/bash 

#SBATCH --job-name      gpu-burn
#SBATCH --cpus-per-task 2
#SBATCH --partition     gpu_interactive
#SBATCH --gpus-per-node 1
#SBATCH --mem           4G
#SBATCH --time          00:10:00
#SBATCH --output        slog/%j.out


module purge 
module load CUDA/12.6.0 GCC/12.3.0 

# start collecting GPU stats in the background 
nvidia-smi --query-gpu=timestamp,uuid,clocks_throttle_reasons.sw_thermal_slowdown,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw,clocks.current.sm \
 --format=csv,nounits \
 -l 5 -f gpu-stats-${SLURM_JOB_ID}.csv & 

./gpu_burn -tc -m 80% 600

Visualising the GPU stats CSV¶

Once your job completes, the gpu-stats-${SLURM_JOB_ID}.csv CSV can be passed to the plotting script to generate an interactive HTML report:

The visualisation script can be found here
Edit the filename on line 12 with your filename .i.e. fn = "gpu-stats-15140945.csv"

This produces a self-contained HTML file with time-series plots for utilisation, memory usage, temperature, power draw, and SM clock frequency across the full job duration. An example output is embedded below.

What "utilisation" actually means¶

Take care when interpreting the utilization.gpu and utilization.memory columns. Per the nvidia-smi documentation, these are time-based metrics, not capacity-based:

GPU utilisation — the percentage of the sample period during which at least one kernel was executing on the GPU.
Memory utilisation — the percentage of the sample period during which global device memory was being read or written.

This means a single small kernel that runs continuously but uses only a tiny fraction of the GPU's compute capacity will still report 100% GPU utilisation. High utilisation is therefore a necessary but not sufficient indicator of efficient GPU use. If you see high utilisation alongside low throughput, the likely cause is a kernel that is poorly optimised for the hardware — a profiler such as Nsight Systems will give you the next level of detail.