Summary comparison — compute, GPUs, and energy (assumptions: public reporting, 2024–2026 hardware)
Key assumptions used: Grok (xAI) trains on large Nvidia H100/H800-based Colossus clusters (dense training at multi-petaflop scale); DeepSeek (Whale Lab) uses Mixture-of-Experts (MoE) designs and non‑Nvidia accelerators (Huawei Ascend / Cambricon + H800-style variants in some reports). Numbers below are order‑of‑magnitude estimates synthesized from published technical notes, reporting, and community analyses.
Raw compute (training)
Grok / Colossus-style dense training:
Training uses dense model compute where every parameter contributes every token. A frontier dense model in the 100B–6T class typically consumes tens to hundreds of PFLOP‑years of total compute (effective TFLOP/s · years). Example-scale: multi‑million to tens of million GPU‑hours across H100-class GPUs for largest trains.
DeepSeek / MoE:
MoE greatly reduces FLOP per token because only subsets of experts are activated. Reported: DeepSeek‑V3 ~250 GFLOPS/token vs 2448 GFLOPS/token for a 405B dense model (paper claim). Reported GPU‑hour totals for DeepSeek‑V3 training are orders of magnitude lower than comparable dense runs (papers/reporting cite low single‑digit million GPU‑hours vs tens of millions for some dense baselines).
Types of GPUs / accelerators
Grok / xAI:
Heavily Nvidia (H100/H800 family) with NVLink / NVSwitch for intra‑node high bandwidth. GPUs optimized for dense tensor compute and large memory bandwidth.
DeepSeek:
Uses MoE-friendly deployments; reported use of Huawei Ascend-family and H800-style accelerators in some deployments. MoE benefits from high interconnect but can be optimized to reduce IB traffic (node‑limited routing); can also run on mixed hardware including lower‑cost consumer GPUs for inference with proper engine/quantization.
GPU counts and cluster design
Dense (Grok) clusters:
Very large single‑site clusters (reports of 1–1.5 GW datacenter power footprints for Colossus‑class installs) — implies tens of thousands of H100/H800 GPUs for frontier training and large on‑demand inference capacity.
MoE (DeepSeek) clusters:
Fewer effective GPU hours required for equivalent capability; MoE still requires many GPUs for parameter storage and routing at scale but can hit similar performance with fewer active FLOPs and specialized routing to reduce cross‑node bandwidth. Reports estimate training DeepSeek‑V3 required a few million GPU‑hours on H800‑class gear (much lower than some dense baselines).
Electricity and power costs (training)
Dense (Grok):
If a Colossus facility is 1–1.5 GW peak, annual electricity for continuous operation is enormous (GW × hours × $/kWh). Example: 1 GW running continuously uses 8.76×10^6 MWh/year; at $0.05–0.12/kWh that’s tens to hundreds of millions $/year just for power (actual training uses a fraction of continuous peak, but peak facility capacity correlates with high power draw during training campaigns).
MoE (DeepSeek):
Lower active FLOPs per token reduce total energy consumed for pretraining; published estimates for large MoE runs imply substantially lower electricity bills for comparable delivered performance. Concrete example: paper claims training requiring ~2.6M GPU‑hours vs dense models requiring 30M+ GPU‑hours — that gap multiplies into energy savings roughly proportional to GPU‑hours × per‑GPU power draw.
Inference cost and hardware for deployment
Dense models (Grok):
High VRAM and throughput GPUs (H100/H800) for latency‑sensitive hosted inference; inference energy per token is higher because all parameters are active.
MoE models (DeepSeek):
Lower per‑token activation reduces inference FLOPs and memory traffic; can be cheaper to serve and, with model‑co‑design, can be run on more diverse hardware (including non‑Nvidia accelerators or consumer GPUs with quantization) for cost‑sensitive deployments.
Capital & operational cost tradeoffs
Dense approach:
Higher CapEx on uniform high‑end Nvidia GPUs, NVSwitch/NVLink networking, and larger datacenter power/cooling; simpler software stack for dense training and standard parallelism.
MoE approach:
Potentially lower compute and energy costs per performance unit but higher software complexity (routing, load balancing), more sensitive communication patterns, and potential need for co‑design of hardware/topology to maximize efficiency.
Caveats and uncertainty
Public numbers vary; some figures are from vendor/industry reports and preprints (DeepSeek paper excerpts) and unconfirmed press reporting for xAI/Colossus. Exact GPU‑hour totals, power footprints, and pricing are often proprietary.
MoE savings depend on gating efficiency, routing overhead, and how many experts are actually active per token; communication overhead can erode benefits if poorly implemented.
Regional electricity prices, datacenter PUE, and ownership vs cloud‑rental change $ estimates substantially.
If you want, I can produce a compact table with estimated GPU‑hours, per‑GPU power draws, and rough $ electricity costs for a few concrete training scenarios (assume H100/H800 specs and $0.06/kWh), using the numbers above.
So, in essence, Deepseek, by necessity, tries the lower the electricity bill to yield the same type of result.
Got it. Yes, there are lot of optimisations possible with the parameters, and when you do some of these optimisations it would potentially have an effect on the quality of the model and quantifying that and reducing these effects is a big part of the research. The compute and energy requirements would be scaled but still linearly.
Here is my prediction - we will see a completely different base paradigm for training these models. Like valves vs transistors. When this happens, we will see an order of magnitude reduction in compute+energy usage and we will probably see multiple iterations of this.
This is what makes this timeline so amazing. For those of us who were of age at the infancy of computers/internet etc, to be able to see another epoch - even bigger than that - and be able to contribute is incredible.
Summary comparison — compute, GPUs, and energy (assumptions: public reporting, 2024–2026 hardware)
Key assumptions used: Grok (xAI) trains on large Nvidia H100/H800-based Colossus clusters (dense training at multi-petaflop scale); DeepSeek (Whale Lab) uses Mixture-of-Experts (MoE) designs and non‑Nvidia accelerators (Huawei Ascend / Cambricon + H800-style variants in some reports). Numbers below are order‑of‑magnitude estimates synthesized from published technical notes, reporting, and community analyses.
Caveats and uncertainty
If you want, I can produce a compact table with estimated GPU‑hours, per‑GPU power draws, and rough $ electricity costs for a few concrete training scenarios (assume H100/H800 specs and $0.06/kWh), using the numbers above.
So, in essence, Deepseek, by necessity, tries the lower the electricity bill to yield the same type of result.
Got it. Yes, there are lot of optimisations possible with the parameters, and when you do some of these optimisations it would potentially have an effect on the quality of the model and quantifying that and reducing these effects is a big part of the research. The compute and energy requirements would be scaled but still linearly.
Here is my prediction - we will see a completely different base paradigm for training these models. Like valves vs transistors. When this happens, we will see an order of magnitude reduction in compute+energy usage and we will probably see multiple iterations of this.
This is what makes this timeline so amazing. For those of us who were of age at the infancy of computers/internet etc, to be able to see another epoch - even bigger than that - and be able to contribute is incredible.
Totally agree.
And true. This kind of transformation is effectively seeing science fiction coming into being.