Now do you guys realise why datacenters and cheap energy are so crucial for the future?
More computational power with less energy cost = more token processing power.
He who can compute more and create the most powerful LLMs and solve the most powerful problems.
We are at that point in comparison to the PC evolution where we could barely fit 16KB of RAM on a PC. Today we can fit close to a TB of RAM on a PC. Thats a huge slop upwards and took 40 years to get to.
With LLMs the compute power will grow exponentially if Trump's plan works and will usher in centuries of prosperity and freedom.
Summary comparison — compute, GPUs, and energy (assumptions: public reporting, 2024–2026 hardware)
Key assumptions used: Grok (xAI) trains on large Nvidia H100/H800-based Colossus clusters (dense training at multi-petaflop scale); DeepSeek (Whale Lab) uses Mixture-of-Experts (MoE) designs and non‑Nvidia accelerators (Huawei Ascend / Cambricon + H800-style variants in some reports). Numbers below are order‑of‑magnitude estimates synthesized from published technical notes, reporting, and community analyses.
Raw compute (training)
Grok / Colossus-style dense training:
Training uses dense model compute where every parameter contributes every token. A frontier dense model in the 100B–6T class typically consumes tens to hundreds of PFLOP‑years of total compute (effective TFLOP/s · years). Example-scale: multi‑million to tens of million GPU‑hours across H100-class GPUs for largest trains.
DeepSeek / MoE:
MoE greatly reduces FLOP per token because only subsets of experts are activated. Reported: DeepSeek‑V3 ~250 GFLOPS/token vs 2448 GFLOPS/token for a 405B dense model (paper claim). Reported GPU‑hour totals for DeepSeek‑V3 training are orders of magnitude lower than comparable dense runs (papers/reporting cite low single‑digit million GPU‑hours vs tens of millions for some dense baselines).
Types of GPUs / accelerators
Grok / xAI:
Heavily Nvidia (H100/H800 family) with NVLink / NVSwitch for intra‑node high bandwidth. GPUs optimized for dense tensor compute and large memory bandwidth.
DeepSeek:
Uses MoE-friendly deployments; reported use of Huawei Ascend-family and H800-style accelerators in some deployments. MoE benefits from high interconnect but can be optimized to reduce IB traffic (node‑limited routing); can also run on mixed hardware including lower‑cost consumer GPUs for inference with proper engine/quantization.
GPU counts and cluster design
Dense (Grok) clusters:
Very large single‑site clusters (reports of 1–1.5 GW datacenter power footprints for Colossus‑class installs) — implies tens of thousands of H100/H800 GPUs for frontier training and large on‑demand inference capacity.
MoE (DeepSeek) clusters:
Fewer effective GPU hours required for equivalent capability; MoE still requires many GPUs for parameter storage and routing at scale but can hit similar performance with fewer active FLOPs and specialized routing to reduce cross‑node bandwidth. Reports estimate training DeepSeek‑V3 required a few million GPU‑hours on H800‑class gear (much lower than some dense baselines).
Electricity and power costs (training)
Dense (Grok):
If a Colossus facility is 1–1.5 GW peak, annual electricity for continuous operation is enormous (GW × hours × $/kWh). Example: 1 GW running continuously uses 8.76×10^6 MWh/year; at $0.05–0.12/kWh that’s tens to hundreds of millions $/year just for power (actual training uses a fraction of continuous peak, but peak facility capacity correlates with high power draw during training campaigns).
MoE (DeepSeek):
Lower active FLOPs per token reduce total energy consumed for pretraining; published estimates for large MoE runs imply substantially lower electricity bills for comparable delivered performance. Concrete example: paper claims training requiring ~2.6M GPU‑hours vs dense models requiring 30M+ GPU‑hours — that gap multiplies into energy savings roughly proportional to GPU‑hours × per‑GPU power draw.
Inference cost and hardware for deployment
Dense models (Grok):
High VRAM and throughput GPUs (H100/H800) for latency‑sensitive hosted inference; inference energy per token is higher because all parameters are active.
MoE models (DeepSeek):
Lower per‑token activation reduces inference FLOPs and memory traffic; can be cheaper to serve and, with model‑co‑design, can be run on more diverse hardware (including non‑Nvidia accelerators or consumer GPUs with quantization) for cost‑sensitive deployments.
Capital & operational cost tradeoffs
Dense approach:
Higher CapEx on uniform high‑end Nvidia GPUs, NVSwitch/NVLink networking, and larger datacenter power/cooling; simpler software stack for dense training and standard parallelism.
MoE approach:
Potentially lower compute and energy costs per performance unit but higher software complexity (routing, load balancing), more sensitive communication patterns, and potential need for co‑design of hardware/topology to maximize efficiency.
Caveats and uncertainty
Public numbers vary; some figures are from vendor/industry reports and preprints (DeepSeek paper excerpts) and unconfirmed press reporting for xAI/Colossus. Exact GPU‑hour totals, power footprints, and pricing are often proprietary.
MoE savings depend on gating efficiency, routing overhead, and how many experts are actually active per token; communication overhead can erode benefits if poorly implemented.
Regional electricity prices, datacenter PUE, and ownership vs cloud‑rental change $ estimates substantially.
If you want, I can produce a compact table with estimated GPU‑hours, per‑GPU power draws, and rough $ electricity costs for a few concrete training scenarios (assume H100/H800 specs and $0.06/kWh), using the numbers above.
So, in essence, Deepseek, by necessity, tries the lower the electricity bill to yield the same type of result.
From my perspective, it seems more that, with LLMs, the REQUIREMENT for computing power will grow exponentially, owing to the geometric growth of the network interconnections. Somebody needs to arrive at a cost-effectiveness metric for a process that may consume $ billions and produce intellectual morons.
I've seen some amazing hallucinations in AI-produced videos. I wouldn't want to have an AI-driven robot performing brain surgery on me. (Or MCAS flying an airplane, but that's another story.)
Hopefully this race to the top makes practical embedded NPUs viable. Ever been stuck at a red light for a while even though no cars are coming? Having an embedded NPU would enable you to not connect the camera to the cloud. Cheaper and more reliable infrastructure, more privacy, and less congestion to boot.
It's not just Claude.
Now do you guys realise why datacenters and cheap energy are so crucial for the future?
More computational power with less energy cost = more token processing power.
He who can compute more and create the most powerful LLMs and solve the most powerful problems.
We are at that point in comparison to the PC evolution where we could barely fit 16KB of RAM on a PC. Today we can fit close to a TB of RAM on a PC. Thats a huge slop upwards and took 40 years to get to.
With LLMs the compute power will grow exponentially if Trump's plan works and will usher in centuries of prosperity and freedom.
Interesting then the direction Deepseek was going in.
BTW: I kind a like GAB as a meta AI. But indeed, Grok rules... on certain matters.
Elaborate?
Summary comparison — compute, GPUs, and energy (assumptions: public reporting, 2024–2026 hardware)
Key assumptions used: Grok (xAI) trains on large Nvidia H100/H800-based Colossus clusters (dense training at multi-petaflop scale); DeepSeek (Whale Lab) uses Mixture-of-Experts (MoE) designs and non‑Nvidia accelerators (Huawei Ascend / Cambricon + H800-style variants in some reports). Numbers below are order‑of‑magnitude estimates synthesized from published technical notes, reporting, and community analyses.
Caveats and uncertainty
If you want, I can produce a compact table with estimated GPU‑hours, per‑GPU power draws, and rough $ electricity costs for a few concrete training scenarios (assume H100/H800 specs and $0.06/kWh), using the numbers above.
So, in essence, Deepseek, by necessity, tries the lower the electricity bill to yield the same type of result.
From my perspective, it seems more that, with LLMs, the REQUIREMENT for computing power will grow exponentially, owing to the geometric growth of the network interconnections. Somebody needs to arrive at a cost-effectiveness metric for a process that may consume $ billions and produce intellectual morons.
I've seen some amazing hallucinations in AI-produced videos. I wouldn't want to have an AI-driven robot performing brain surgery on me. (Or MCAS flying an airplane, but that's another story.)
Hopefully this race to the top makes practical embedded NPUs viable. Ever been stuck at a red light for a while even though no cars are coming? Having an embedded NPU would enable you to not connect the camera to the cloud. Cheaper and more reliable infrastructure, more privacy, and less congestion to boot.
I highly predict that this is where things will head. Personal cloud.
I actually have a lot of my stuff including photos sync straight to my home pc via pangolin
Let's just see what happen here. I do like Grok.