Choosing a GPU for your workload

The BMRC cluster has several GPU types across different generations. Picking the right one upfront avoids the common situation of a job queuing for hours to reach a high-end GPU when a smaller one would have been perfectly sufficient — or conversely, a job failing mid-run because it ran out of memory on an underpowered device.

GPU types available¶

GPU	Memory	Generation	Notes
`p100-pcie-16gb`	16 GB HBM2	Pascal (2016)	Oldest generation, shortest queues
`p100-sxm2-16gb`	16 GB HBM2	Pascal (2016)	Higher memory bandwidth than PCIe variant
`v100-pcie-16gb`	16 GB HBM2	Volta (2018)	Tensor Cores, good general-purpose training
`v100-pcie-32gb`	32 GB HBM2	Volta (2018)	Same as above with more headroom for larger models
`quadro-rtx6000`	24 GB GDDR6	Turing (2018)	Turing Tensor Cores, RT cores (not relevant for HPC)
`quadro-rtx8000`	48 GB GDDR6	Turing (2018)	Large memory, good for models that don't fit on 24 GB
`a100-pcie-40gb`	40 GB HBM2e	Ampere (2020)	Significant leap in training throughput, BF16 support
`a100-pcie-80gb`	80 GB HBM2e	Ampere (2020)	Best choice for large models; NVLink-capable
`l4`	24 GB GDDR6	Ada Lovelace (2023)	Efficient inference and fine-tuning, lower power draw
`GH200`	96 GB HBM3	Hopper (2023)	Grace Hopper Superchip; unified CPU+GPU memory; highest throughput

How to choose¶

Memory is usually the binding constraint. Estimate your model's memory footprint first — a rough rule of thumb is that a model with N parameters in float32 needs 4×N bytes, and in mixed precision (float16/bfloat16) around 2×N bytes, before accounting for activations and optimiser states during training. If your model comfortably fits in 16 GB, there is no benefit in queuing for an A100.
For training from scratch or large-scale fine-tuning, prefer the A100 variants. The Ampere generation introduced native BF16 Tensor Core support which most modern frameworks (PyTorch ≥ 1.10, TensorFlow ≥ 2.7) exploit automatically, giving substantially higher throughput than equivalent FP16 on earlier hardware.
For inference or lightweight fine-tuning (e.g. LoRA on a model that fits in 24 GB), the L4 is a strong choice — it is more power-efficient and typically has shorter queues than the A100s.
For very large models that exceed 40 GB — large language models, full-precision foundation model fine-tuning — the A100-80GB or GH200 are the appropriate targets. The GH200's unified CPU–GPU memory architecture is particularly well suited to workloads that stream large datasets, though it is a shared and limited resource.
If your job is not memory-bound and primarily needs CUDA parallelism, the V100s are a practical middle ground with good software support and reasonable availability.
When in doubt, start small. Run a short test job on a P100 or V100 with a reduced batch size to confirm your code runs correctly and get a rough sense of memory usage before committing to a longer run on a higher-tier GPU.