Choosing a GPU for your workload
The BMRC cluster has several GPU types across different generations. Picking the right one upfront avoids the common situation of a job queuing for hours to reach a high-end GPU when a smaller one would have been perfectly sufficient — or conversely, a job failing mid-run because it ran out of memory on an underpowered device.
GPU types available¶
| GPU | Memory | Generation | Notes |
|---|---|---|---|
p100-pcie-16gb |
16 GB HBM2 | Pascal (2016) | Oldest generation, shortest queues |
p100-sxm2-16gb |
16 GB HBM2 | Pascal (2016) | Higher memory bandwidth than PCIe variant |
v100-pcie-16gb |
16 GB HBM2 | Volta (2018) | Tensor Cores, good general-purpose training |
v100-pcie-32gb |
32 GB HBM2 | Volta (2018) | Same as above with more headroom for larger models |
quadro-rtx6000 |
24 GB GDDR6 | Turing (2018) | Turing Tensor Cores, RT cores (not relevant for HPC) |
quadro-rtx8000 |
48 GB GDDR6 | Turing (2018) | Large memory, good for models that don't fit on 24 GB |
a100-pcie-40gb |
40 GB HBM2e | Ampere (2020) | Significant leap in training throughput, BF16 support |
a100-pcie-80gb |
80 GB HBM2e | Ampere (2020) | Best choice for large models; NVLink-capable |
l4 |
24 GB GDDR6 | Ada Lovelace (2023) | Efficient inference and fine-tuning, lower power draw |
GH200 |
96 GB HBM3 | Hopper (2023) | Grace Hopper Superchip; unified CPU+GPU memory; highest throughput |
How to choose¶
-
Memory is usually the binding constraint. Estimate your model's memory footprint first — a rough rule of thumb is that a model with N parameters in float32 needs 4×N bytes, and in mixed precision (float16/bfloat16) around 2×N bytes, before accounting for activations and optimiser states during training. If your model comfortably fits in 16 GB, there is no benefit in queuing for an A100.
-
For training from scratch or large-scale fine-tuning, prefer the A100 variants. The Ampere generation introduced native BF16 Tensor Core support which most modern frameworks (PyTorch ≥ 1.10, TensorFlow ≥ 2.7) exploit automatically, giving substantially higher throughput than equivalent FP16 on earlier hardware.
-
For inference or lightweight fine-tuning (e.g. LoRA on a model that fits in 24 GB), the L4 is a strong choice — it is more power-efficient and typically has shorter queues than the A100s.
-
For very large models that exceed 40 GB — large language models, full-precision foundation model fine-tuning — the A100-80GB or GH200 are the appropriate targets. The GH200's unified CPU–GPU memory architecture is particularly well suited to workloads that stream large datasets, though it is a shared and limited resource.
-
If your job is not memory-bound and primarily needs CUDA parallelism, the V100s are a practical middle ground with good software support and reasonable availability.
-
When in doubt, start small. Run a short test job on a P100 or V100 with a reduced batch size to confirm your code runs correctly and get a rough sense of memory usage before committing to a longer run on a higher-tier GPU.
