Server GPU & AI Workstation Buyer's Guide
AI workstation builds are the most rapidly evolving infrastructure category right now. The right card today is over-spec'd in 18 months. The wrong card runs hot, throttles under load, and never delivers spec performance. This guide covers GPU selection by workload, VRAM and bandwidth math, multi-GPU scaling (NVLink, PCIe), cooling for high-TDP cards, and refurb vs new procurement.
Workload Classification — Inference, Fine-Tuning, Pre-Training
Inference workloads (BERT serving, Llama 7B-13B serving, image classification) run on a wide range of GPUs. Consumer RTX 4090 / 5090 (24GB / 32GB VRAM) handle most inference workloads cost-effectively. INT8 quantized models stretch GPU capacity 4x.
Fine-tuning workloads (LoRA, full fine-tune of 7B-30B models) need 24-80GB VRAM. RTX 5090, A6000, H100 80GB. NVLink helps for multi-GPU fine-tunes.
Pre-training workloads (training a base model from scratch) require H100 / H200 / B200 in 8-card NVLink configurations. Out of scope for most enterprise deployments — this is hyperscaler / lab territory.
VRAM and Memory Bandwidth Math
LLM serving capacity scales with VRAM. Rule of thumb: model parameter count (billions) × precision bytes (FP16 = 2, INT8 = 1, INT4 = 0.5) = VRAM in GB. A Llama 70B model at INT8 needs ~70 GB; at INT4 it fits in 35 GB.
For multi-user inference serving, batch size matters. Each concurrent user adds KV-cache overhead — typically 0.5-2 GB per user depending on context length. Plan VRAM headroom accordingly.
Memory bandwidth (HBM2e on H100, HBM3 on H200, GDDR6X on RTX) determines token throughput. Bandwidth-bound workloads (LLM inference) benefit more from H100 than from raw FLOPS uplift.
Multi-GPU Scaling — NVLink, PCIe, NVSwitch
PCIe Gen4 x16 delivers 32 GB/s bidirectional. Sufficient for inference serving where the GPU works on a self-contained batch.
NVLink Gen3 (RTX A6000, A100) delivers 600 GB/s peer-to-peer. Required for fine-tuning that splits model across cards.
NVLink Gen4 (H100) delivers 900 GB/s. NVSwitch (H100 in DGX / HGX systems) provides full-mesh GPU-to-GPU bandwidth.
Consumer RTX cards (4090, 5090) do not have NVLink. Multi-card setups are limited to data-parallel workloads where each card operates on its own batch independently.
Cooling for High-TDP Cards
H100 SXM draws 700W per card. 8-card HGX systems draw 6 kW from GPUs alone — total system draw approaches 10 kW.
RTX 5090 draws 575W. 4-card workstation builds are 2.3 kW from GPUs.
Rack power planning: confirm PDU capacity supports the GPU rack power draw without exceeding 80% of breaker rating.
Cooling: dense GPU racks need hot aisle / cold aisle containment plus aisle airflow planning. Liquid-cooled options (NVIDIA HGX, custom water blocks) handle higher density but require chilled water infrastructure.
New vs Refurbished GPU Procurement
For tier-1 production AI inference, new sealed OEM is the only choice — manufacturer warranty matters when single-card failures take service offline.
For lab / dev / training infrastructure, refurbished H100 and A100 cards are available at 40-60% off new pricing. Pro Disk Network tests every refurbished GPU at full TDP for 4 hours before shipping.
Allocation matters: H100 and H200 supply remains constrained. Place orders 4-8 weeks ahead of need. RTX 5090 supply tightened in Q4 2025 — confirm availability before committing build dates.
Need help picking?
Pro Disk Network engineering can validate a specific configuration against your chassis, workload, and budget. Email sales@prodisknetwork.com with your server model and target spec. Response within one business day.