NVIDIA H100 SXM5 draws 700 W on full FP16/BF16 training. B200 SXM follows at 1,000 W. GB200 in a full NVL36 rack 132 kW. These numbers are beyond what air cooling physically handles — not economically, not efficiently, but **physically**. This article is about what the decision looks like when you're standing in front of an 8-node H100 cluster or planning a first B200 deployment, and why two-phase immersion has died over the last 18 months.
Why air cooling stops above 30 kW/rack
Air cooling works on the principle: heatsink takes heat into copper heatpipes → heatsink fins → fan-driven air → CRAC unit cools the air → repeat. By the heat transfer equation:
`Q = ṁ × cp × ΔT`
Where `Q` is the heat to remove (W), `ṁ` is air mass flow (kg/s), `cp` is the specific heat capacity of air (1,005 J/kg·K) and `ΔT` is the difference between outlet and inlet air temperature.
For a 30 kW rack with ΔT = 15 K you need: `ṁ = 30,000 / (1,005 × 15) = 1.99 kg/s` = ~1,650 m³/h of air
That means fans in the rack + CRAC consuming **3.5–4.5 kW just for ventilation**. At 60 kW per rack you'd need double the air — 3,300 m³/h, which already means 90+ dB noise and physical chassis limits (you have nowhere to put more fans). And the ventilation alone consumes 8–11 kW. **PUE for pure air at a 60 kW rack realistically drops to 1.8–2.2** — economically unmanageable.
The boundary layer across chip → heatsink junction is another limit: H100 chip junction temperature must stay below 87 °C, ambient air into the CRAC max 27 °C, so ΔT across the whole path is 60 °C. For a 700 W chip across 7 cm² die area the heat flux is 100 W/cm². Air can't transfer this efficiently at reasonable speeds (3–6 m/s) — above 50 W/cm² liquid cooling becomes necessary.
**Practical threshold:** above 30 kW/rack air cooling loses economic sense, above 50 kW/rack it loses physical sense.
Direct-to-Chip (DTC) — how it actually works
DTC = a cold plate sits directly on CPU and GPU, liquid (typically propylene glycol-water 25:75 or PG 30:70) flows through microchannels in the cold plate. The liquid picks up heat, goes into the **CDU** (Coolant Distribution Unit), which passes it to a secondary loop — typically facility water that goes into a chiller or dry cooler.
Topology in a real 8-node H100 cluster
- **8× DGX H100** or **8× HPE Cray EX H100** — each node draws 10.2 kW (8× H100 SXM5 + 2× Sapphire Rapids CPU + DPU + NIC + PSU losses)
- **Rack TDP:** ~85 kW (8 nodes + 2× InfiniBand switch + storage chassis)
- **DTC coverage:** GPU SXM5 + CPU. NIC + DPU stay air-cooled (12–15% of residual heat)
- **CDU per rack:** Asetek RackCDU D2C or CoolIT CHx650, capacity 100–150 kW per CDU unit
- **Secondary loop:** facility water 32–40 °C input → 45–55 °C output (W4 ASHRAE liquid cooling envelope)
- **Heat rejection:** dry cooler in the EU climate (no chiller needed at 32 °C+ water) — free cooling year-round on the right design
Top DTC vendors in 2026
**Asetek** - RackCDU D2C generation 4 — the broadest DTC ecosystem - Cold plates for H100, B200, GB200, Intel Xeon, AMD EPYC - CDU capacity 80/120/200 kW - **Retrofit price:** 5,800–7,200 EUR / rack for coldplates + manifold + quick disconnects - **CDU price:** 18–28k EUR per 120 kW unit
**CoolIT Systems** - AHx series (Asetek-style) + CHx series (server-level integrated) - For OEM (HPE Cray, Lenovo Neptune, Dell PowerEdge XE9680L) - Stronger OEM integration, fewer retrofit kits - **Price:** typically bundled in the OEM server quote, +3–4k EUR / server vs air variant
**Submer DTC (previously CoolIT Direct-to-Chip)** - Originally an immersion vendor, now DTC products too - Outdoor CDU variants with air-cooled rejection (no facility water required) - **Price:** 6,500–8,000 EUR / rack
**Motivair** - Specialised in high-density HPC retrofits - ColdPort technology (for HPE Cray EX) - **Price:** mostly project-based, 30–80k EUR per cluster
A real benchmark — 8-node DGX H100 cluster
In a project we audited (greenfield AI lab near Munich, fully deployed in Q4 2025):
| Parameter | Air-cooled DGX H100 | DTC Asetek retrofit | |-----------|---------------------|----------------------| | Server TDP per node | 10.2 kW | 10.2 kW | | Cooling power per node | 1.4 kW (fans + CRAC share) | 0.28 kW (residual fans + CDU pump share) | | **PUE of the whole cluster** | **1.45** | **1.08** | | Annual consumption (8 nodes) | 715 MWh | 535 MWh | | At 0.18 EUR/kWh | **128,700 EUR / year** | **96,300 EUR / year** | | CAPEX delta | baseline | +52,000 EUR (8 racks DTC + CDU) | | **Payback** | — | **~19 months** |
With B200 and B300 this benefit grows further (higher TDP → higher ratio of heat rejected through liquid vs. air).
Immersion — single-phase reality
Single-phase immersion = the entire server (without fans) submerged in a dielectric fluid (Submer SmartCoolant, ShellLubri DCT 16, Castrol DC iX). Fluid flows through the tank, enters at 35–45 °C, leaves at 45–55 °C.
Capacity and PUE
- **Submer SmartPodX:** 100 kW per tank, footprint ~2 m²
- **Asperitas AIC24:** 50 kW per tank
- **GRC ICEraQ Quad:** 168 kW per quad-tank
- **PUE:** 1.03–1.06 (the best in industry)
Real-world limits
1. **Server form factor.** Not every server can be put in immersion. An NVIDIA HGX H100 8-GPU baseboard works, but fans must be removed and the thermal interface reapplied with an immersion-specific gap pad (ShellLubri SC2). Some OEMs (Supermicro, Inspur) offer immersion-ready variants; HPE Cray EX does not.
2. **Maintenance.** Pulling a server from the tank means: power down, wait 10–15 minutes for fluid drip-off, lift the server with a crane (typically 30–50 kg + 8–12 kg of fluid inside), move to the service bench. The operation takes 45–90 minutes instead of 5 minutes for air-cooled hot swap.
3. **Cabling.** Optical cables with a PVC jacket degrade in some fluids. LSZH (Low Smoke Zero Halogen) or PTFE jackets are required. Cabling cost surcharge 1.5–2× over standard.
4. **CAPEX:** 25–40k EUR / rack (tank + fluid + CDU secondary loop). For an 8-rack cluster the delta is 200–320k EUR vs DTC.
When single-phase immersion wins
- **Extreme density.** GB200 NVL72 in greenfield — 132 kW in one NVL rack, DTC would need custom CDU sizing, immersion absorbs it natively.
- **Edge deployment with space constraint.** 200 kW IT load in a 20 ft container — air cooling doesn't fit, DTC fits but tightly packed, immersion is the most compact.
- **Greenfield with a 5+ year horizon.** CAPEX delta amortises through OPEX (PUE 1.05 vs 1.08).
For brownfield retrofit of an existing DC with air infrastructure, single-phase immersion is **almost always a bad choice** — form factor change + service disruption + cabling rebuild + maintenance retraining.
Two-phase immersion — why it died
Two-phase = the fluid **transitions into a gas** on contact with the hot chip, condenses on a cooling coil above the tank, drops fall back. The most efficient physical heat-transfer principle — passive, no pumps.
In 2020–2023 two-phase was considered SOTA: PUE 1.02, capacity 200–300 kW per tank, no mechanical motion in the primary loop. **3M Novec 7100, 7500, 649** were the flagship fluids — perfluorinated, good thermal properties, environmentally "safe."
The reality 2024–2026: - **December 2022:** 3M announced end of production of all PFAS (per- and polyfluoroalkyl substances) by end of 2025. - **2023:** EU REACH proposal for PFAS restriction (over 10,000 chemicals, including Novec). The final restriction is expected to take effect 2026–2028. - **2024:** Novec 7100 price rose from 65 EUR/kg to 180–220 EUR/kg, availability restricted to existing customers. - **2025–2026:** No major vendor (Submer, Asperitas, GRC) sells new two-phase systems. Existing installations are maintained, but roadmaps are predominantly single-phase.
Replacement fluids (LiquidCool LCS-CF series, Engineered Fluids ElectroCool) are in pilot phase. For a production greenfield cluster in 2026 **two-phase isn't a realistic choice** — vendor support, regulatory risk, long-term fluid availability.
CDU sizing — the rule everyone fine-tunes
The CDU (Coolant Distribution Unit) is the heart of a DTC deployment. Heat exchanger between the primary (server) loop and the secondary (facility water) loop. Pumps in the primary loop.
Rule of thumb
- **Per-rack CDU:** 1× CDU per rack on 50–100 kW racks. Single point of failure per rack, but a simple architecture. Asetek RackCDU D2C 50.
- **Per-row CDU:** 1× CDU serves 4–6 racks, 200–500 kW total. Better economic scaling, but failure hits the whole row. Asetek CoolIT CHx650.
- **Central CDU:** 1× CDU for the whole DC (1+ MW). Best economic scaling, but requires sophisticated plumbing with thousands of quick disconnects.
N+1 redundancy
For an AI training cluster running 24/7 where a lost checkpoint costs 8–24 hours of training time, CDU redundancy is **mandatory**. N+1 means: for a 100 kW load you have 2× 100 kW CDU in active-passive, or 3× 50 kW CDU in active-active load sharing.
CAPEX delta: +35–60% on cooling infrastructure. Payback: the first CDU pump failure (typically cycle ~5–7 years at baseline maintenance).
Leak risk and insurance
The most common client concern: "what if fluid leaks onto the servers?"
The reality after 5 years of DTC deployments (data from two insurers that shared aggregated claims data for EU AI infrastructure): - **Leak frequency:** 0.3–0.8 incidents per 1,000 rack-years - **Damage per incident:** typically < 5% of the equipment (quick disconnect prevents catastrophic spill) - **Mean repair time:** 2–6 hours (drain, replace coupling, refill, test)
For comparison: an air-cooled DC has its own failure modes (CRAC failure, condensate leak from evaporator coils, ventilation stop). Aggregated downtime over 5 years is comparable or lower on a properly designed DTC.
**Insurance:** Allianz, Munich Re, AXA have had DTC-specific policies since 2023. Premium delta vs air-cooled is ~3–8% in the EU in 2026 — sharply down from the 15–20% in 2020. Required: leak detection sensors (Aquasense, EcoFlux), automatic shut-off valves per rack, drip trays under the CDU, documented emergency response plan.
A 15-minute decision framework
1. **Which GPU and what density?** - H100 SXM5 single rack (8 GPUs, ~85 kW) → DTC mandatory - B200 8-GPU baseboard (~120 kW per rack) → DTC or immersion - GB200 NVL36/NVL72 (132–192 kW per rack) → DTC with high-capacity CDU or single-phase immersion 2. **Brownfield retrofit or greenfield?** - Brownfield → DTC (existing servers can be retrofitted or replaced with DTC variants), no tank rebuild - Greenfield with a 5+ year horizon → consider immersion if density > 100 kW/rack 3. **What maintenance can the team handle?** DTC maintenance is similar to air-cooled (hot swap remains). Immersion needs 6–12 months of technical upskilling. 4. **What is the facility water input?** If you have a source < 35 °C (dry cooler in the EU climate, or a small chiller) → DTC is ideal. If you don't → you have to budget for a chiller plant CAPEX. 5. **What PUE target?** 1.08–1.12 → DTC. 1.03–1.06 → single-phase immersion (with higher CAPEX uplift). 6. **Two-phase immersion?** Out. Come back in 2 years if non-PFAS alternatives reach production maturity.
A practical tip in the tender process
Demand in the AI cluster cooling quote:
- **Per-node thermal envelope:** GPU junction temp budget, CPU junction temp budget, residual air cooling for NIC/DPU
- **CDU sizing with 30% reserve** for future GPU upgrade (B300, R100)
- **Facility water specification:** input/output temperature, flow rate, water chemistry (pH, conductivity, biofouling protection)
- **Service runbook:** quick disconnect procedure, leak response, CDU pump failover test
- **Insurance + warranty:** how many leak incidents the vendor warranty covers, what insurance premium the supplier recommends
In a DGX H100 deployment audit in 2025 we found a client supplier offering an 80 kW CDU for 85 kW racks. At full load (training Llama 3.3 405B fine-tune) the CDU ran at 106% capacity, water output rose from 50 °C to 62 °C, GPU junction temperature climbed to 84 °C — 3 °C below thermal throttle. Marginal. In a summer peak with warm facility water (39 °C input) it would have throttled. **A 30% CDU sizing reserve is non-negotiable.**
---
*We do AI cluster design + cooling architecture for 8-node and larger deployments, from H100 through B200 to GB200. If you're planning a cluster above 500 kW IT load, the first design workshop (4 hours) walks through the DTC vs immersion decision for your specific build-out with numerical PUE and CAPEX comparison.*