Skip to content

Instrumenting a Slurm batch script for GPU monitoring

Rather than sampling GPU metrics interactively, you can instrument your Slurm submission script to collect statistics automatically for the full duration of the job. Add the following near the top of your script, after your #SBATCH directives:

# Start collecting GPU stats in the background
nvidia-smi \
  --query-gpu=timestamp,uuid,clocks_throttle_reasons.sw_thermal_slowdown,\
utilization.gpu,utilization.memory,memory.used,memory.total,\
temperature.gpu,power.draw,clocks.current.sm \
  --format=csv,nounits \
  -l 5 \
  -f gpu-stats-${SLURM_JOB_ID}.csv &
Index

The -l 5 flag polls every 5 seconds and writes each sample as a CSV row to gpu-stats-${SLURM_JOB_ID}.out. Because the process is launched with & it runs in the background alongside your workload and is automatically terminated when the job ends.

The fields collected are:

Field Description
timestamp Wall-clock time of the sample
uuid GPU device UUID
clocks_throttle_reasons.sw_thermal_slowdown Whether the GPU is throttling due to temperature
utilization.gpu GPU utilisation (%) — see note below
utilization.memory Memory bus utilisation (%) — see note below
memory.used / memory.total Device memory in MiB
temperature.gpu GPU die temperature in °C
power.draw Instantaneous power draw in W
clocks.current.sm SM clock frequency in MHz

Sample Slurm script

#!/bin/bash 

#SBATCH --job-name      gpu-burn
#SBATCH --cpus-per-task 2
#SBATCH --partition     gpu_interactive
#SBATCH --gpus-per-node 1
#SBATCH --mem           4G
#SBATCH --time          00:10:00
#SBATCH --output        slog/%j.out


module purge 
module load CUDA/12.6.0 GCC/12.3.0 

# start collecting GPU stats in the background 
nvidia-smi --query-gpu=timestamp,uuid,clocks_throttle_reasons.sw_thermal_slowdown,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw,clocks.current.sm \
 --format=csv,nounits \
 -l 5 -f gpu-stats-${SLURM_JOB_ID}.csv & 

./gpu_burn -tc -m 80% 600


Visualising the GPU stats CSV

Once your job completes, the gpu-stats-${SLURM_JOB_ID}.csv CSV can be passed to the plotting script to generate an interactive HTML report:

  • The visualisation script can be found here
  • Edit the filename on line 12 with your filename .i.e. fn = "gpu-stats-15140945.csv"

This produces a self-contained HTML file with time-series plots for utilisation, memory usage, temperature, power draw, and SM clock frequency across the full job duration. An example output is embedded below.

What "utilisation" actually means

Take care when interpreting the utilization.gpu and utilization.memory columns. Per the nvidia-smi documentation, these are time-based metrics, not capacity-based:

  • GPU utilisation — the percentage of the sample period during which at least one kernel was executing on the GPU.
  • Memory utilisation — the percentage of the sample period during which global device memory was being read or written.

This means a single small kernel that runs continuously but uses only a tiny fraction of the GPU's compute capacity will still report 100% GPU utilisation. High utilisation is therefore a necessary but not sufficient indicator of efficient GPU use. If you see high utilisation alongside low throughput, the likely cause is a kernel that is poorly optimised for the hardware — a profiler such as Nsight Systems will give you the next level of detail.