A local LLM workstation is a different machine than a gaming PC. Different load profiles, different cooling, different decisions about RAM/CPU/PSU. Clients who build "a PC for AI" off a gaming catalog often end up with a machine that doesn't deliver what it should.
Mistake 1 — RAM too small, but fast
For an LLM workstation you need RAM as buffer for KV cache, model weight overflow, embedding pipelines. 32 GB is too little. 64 GB is often the minimum. 128 GB is a reasonable starting point if you also plan to do fine-tuning and inference in parallel.
Frequency? Secondary. The difference between DDR5-4800 and DDR5-6400 is 2–4% in real LLM workloads. The difference between 64 GB and 128 GB is the difference between "works" and "doesn't work" when loading a 70B model with 32k context.
Rule: capacity > frequency. Always.
Mistake 2 — PSU sized "exactly"
A GPU for AI inference (RTX 4090, A6000, A100, H100) has a TDP of 350–700 W. The whole PC under full LLM load draws 600–1,200 W. The client buys an 850 W PSU because "the calculator said 800 W".
A PSU has highest efficiency at 50% load. At 95% load efficiency is lower, PSU temperature is higher, lifespan is shorter. A PSU at 95% load will annoy you 24/7 with its fans and after two years will call you back with a defect.
Rule: size the PSU to 130% of expected peak draw. 850 W calculator → 1,200 W PSU. Small price difference, significant lifespan difference.
Mistake 3 — Air cooling on a GPU that stays in 24/7 operation
An RTX 4090 in a gaming use case runs 2 hours a day, 95% of the time it's idle. Air cooling is enough.
An RTX 4090 as a local LLM inference endpoint runs 24/7, often at 60–90% GPU utilization. Air cooling under this profile means: - Higher operating temperatures (>80 °C continuous) - Noise of 50–60 dB (= disruptive in open offices) - Throttling when ambient > 28 °C
Liquid cooling (AIO 360 mm minimum, ideally a custom loop) under 24/7 LLM workloads brings temperatures down to 60–70 °C, noise to 35–40 dB, and eliminates throttling.
Rule: if the GPU is planned for > 8 h per day, liquid cooling. Always.
Mistake 4 — NVMe as boot disk, but data on HDD
Model weights for a 70B model = 40–140 GB. Loading from HDD takes 5–10 minutes. Loading from NVMe (Gen 4) takes 30 seconds.
During development, when you restart the server multiple times a day, 9 minutes × 5 = 45 minutes of lost time per day. Monthly ~15 hours. A 2 TB NVMe in 2026 costs around €130. Payback in 2 days of work.
Rule: model weights MUST be on NVMe Gen 4 or better. HDD is only for the offline-backup model archive.
Mistake 5 — Single GPU, single point of failure
For a serious LLM workload you don't size with a single GPU. Reasons: - On GPU failure = the whole server goes down. A replacement takes days to order. - During firmware/driver update = the whole server goes down during testing. - Models > 13B can't be hosted on a single consumer GPU at acceptable speed.
A dual GPU setup (2× RTX 4090 or 2× A6000 via NVLink) enables: - Tensor parallelism for larger models - Hot-failover when one card fails - Continuous A/B testing of different models
Cost: ~2× GPU + better mainboard with 2× PCIe 4.0 x16. Differential ~€3,000–5,000. For an operation that depends on availability, it amortizes in weeks.
Conclusion
An LLM workstation is not "a gaming PC with a better GPU". It is a dedicated workload server that deserves a dedicated spec — RAM capacity before frequency, PSU with headroom, liquid cooling, NVMe for data, dual GPU for resilience.
If you're building your own stack, get these 5 things into the first design. If you add them later, it costs double.
---
*We hold this discipline on every AI hardware build we deliver. We can walk through a hardware spec list for your specific use case in 30 minutes on a call — usually a single line about the goal (local inference, fine-tune, or both) is enough to choose between 3 reference configurations.*