Back to Technical Guides
AI Software Intermediate

TPU v8 vs Blackwell: How AI Silicon Is Splitting Into Training and Inference Chips

Google's TPU v8t/v8i is the cleanest statement yet of a structural shift: training and inference want different silicon. Here's the architecture, the tradeoffs against NVIDIA Blackwell's unified approach, and what it means for what you actually pay per token.

S5 Labs Team May 16, 2026

For about ten years, “AI accelerator” meant one category of chip. More FLOPS, more memory bandwidth, ship it to anyone who needs to train or serve a model. NVIDIA built the dominant ecosystem on that thesis — A100, H100, and now Blackwell — and the buying decision collapsed to “how many can I get and when.”

Google’s eighth-generation TPU breaks that pattern in the open. The lineup is no longer a single chip. There is TPU v8t for training and TPU v8i for inference, with different topologies, different memory hierarchies, different perf-per-dollar curves, and different SKUs. It is the cleanest public statement yet of a bet the industry has been quietly making for two years: that training and inference are different workloads with different bottlenecks, and that specialized silicon at each end wins on total cost of ownership.

This article walks the architecture of both v8 SKUs, the math of where the workloads actually diverge, and what’s true for the buyer choosing between a GB200 NVL72 rack and a TPU v8i pod. At frontier scale, the decision is rarely about peak FLOPS. It is about memory locality, interconnect topology, and the engineering bill to migrate your stack.

Why training and inference are different workloads

Before naming chips, it helps to name the bottlenecks. The reason a single architecture optimized for both is a compromise is that the workloads disagree about almost everything that matters for silicon design.

Training is a bandwidth and goodput problem

A pretraining run is, at the chip level, a steady-state loop: forward pass over a batch, backward pass that produces gradients for every parameter, an all-reduce of those gradients across every accelerator in the cluster, and an optimizer step that updates weights. Repeat for trillions of tokens.

Two things dominate. First, state size. The optimizer (Adam, AdamW, Lion) typically holds 2–3× the model’s parameters in fp32 momentum and variance buffers, plus gradient accumulators. A 1.6T-parameter MoE doesn’t just need 1.6T × 2 bytes of weight memory; the resident training state can be six to eight times that. The realistic ask is several petabytes of memory addressable as a single pool. Second, collective communication. After every step, every chip has to share its gradient slice with every other chip. The interconnect topology — bandwidth, latency, fault tolerance — sets the ceiling on how fast the cluster can iterate.

Then there’s the operational tax: goodput, the fraction of wall-clock time the cluster is actually producing useful gradient updates instead of waiting on a failed link, a slow node, or a checkpoint reload. On a 10,000-chip cluster, one percentage point of goodput is the difference between a 30-day run and a 30.3-day run — and the latter, at hyperscaler chip-hour rates, costs millions of dollars more.

Notice what is not on this list: per-chip peak FLOPS. Peak compute matters, but it is one constraint among several, and at frontier scale it is rarely the binding one.

Inference is a memory-locality and latency problem

Inference looks completely different at the silicon level. You are running one forward pass per generated token, against a model whose weights are already resident, with a single user (or small batch) waiting on the result. There is no gradient. There is no all-reduce per token. There is, however, the KV cache — the running record of every key and value tensor produced by every prior token, which the next token has to attend over.

The KV cache is the inference equivalent of optimizer state, except worse: it scales linearly in context length, linearly in batch size, and it is per-user, not shared. A 70B-class model serving ten concurrent users with 100K-token contexts can spend more memory on cache than on weights. That memory has to be read on every decode step, because the new query needs to attend back to every prior position. The metric that determines whether your serving stack pencils out is not peak FLOPS — it is how much of the active cache fits in fast memory, and how fast you can move the rest.

This is why SRAM has become the headline number for inference silicon. SRAM is the on-die memory that sits between registers and HBM. It is small (hundreds of megabytes, not gigabytes), expensive per byte, and roughly an order of magnitude faster than HBM. A chip with more SRAM can hold more of the working cache locally, which translates directly into fewer trips to HBM, lower per-token latency, and higher throughput on long contexts.

One workload wants ocean-scale memory pooling and brutal interconnect bandwidth. The other wants tiny, fast cache locality and the lowest possible time-to-first-token. A single chip optimized for both is, by definition, a compromise. That is the conceptual claim the v8 split makes explicit.

The TPU roadmap has been moving here for years

The v8 split is not a surprise pivot. It is the culmination of a trajectory Google has been walking since at least 2017.

  • TPU v1 (2015) was inference-only — a domain-specific accelerator for Google Search and Translate.
  • v2/v3 added training capability and bf16 math.
  • v4 (2022) introduced the optically-switched supercomputer topology described in Google’s v4 paper, and started the formal split: there was a flagship and a smaller variant.
  • v5 went further: v5p was the frontier training part; v5e was a cost-optimized variant pitched at serving and fine-tuning workloads.
  • Trillium (v6) continued the dual track.
  • Ironwood (v7) was Google’s first explicit inference-first generation, with on-chip SRAM and topology optimizations targeted at serving.
  • v8 drops the pretense and ships two distinct SKUs with different topologies, different memory hierarchies, and different software-stack tuning.

The reframe — and this is the move that matters editorially — is that v8 is not Google reacting to Blackwell. It is Google committing publicly to a thesis it has been building toward for four generations: AI silicon is no longer one category of chip.

TPU v8t: Google’s training thesis in silicon

Google’s specifications for v8t, announced at Cloud Next ‘26, foreground three numbers.

Scale. A v8t superpod connects 9,600 TPUs through a new Inter-Chip Interconnect topology. That is a step-change in coherent training fabric size relative to Ironwood. Crucially, the spec talks about the pod as a unit, not the chip — the design is rack-and-pod-first, with the per-chip number functioning as an implementation detail rather than a product page.

Pooled memory. The pod exposes 2 PB of shared high-bandwidth memory addressable across all 9,600 chips. This is the line that matters for trillion-parameter MoE training, where the binding constraint is fitting optimizer state plus expert weights in a coherent address space without sharding gymnastics. Two petabytes of pooled HBM is enough to hold a 1.6T MoE with full optimizer state and gradient accumulators, plus a healthy amount of activation memory, without spilling.

Generation-over-generation gain. Google claims 3× compute throughput vs. Ironwood (v7) and up to 2× perf-per-watt. Both numbers come with the same caveat that applies to every silicon launch: they are vendor-selected comparisons to the vendor’s own prior generation, on workloads the vendor picked. NVIDIA makes equivalent generation-over-generation claims for Blackwell vs. Hopper. Treat them as plausible engineering deltas, not as cross-vendor benchmarks.

The architectural commitment behind v8t is that frontier-scale training is a memory and interconnect problem, not a compute problem. A 1.6T MoE with 32 active experts at any time wants its expert weights addressable without explicit sharding gymnastics; it wants its gradient buffers in a coherent pool; and it wants the all-reduce at every step to finish before the next forward pass starts. v8t is engineered around exactly that loop.

The goodput angle is the part Google has been quietest about and that matters most for buyers. Cluster reliability at 10,000 accelerators is a different engineering problem than at 1,000 — link failures, host outages, partial-pod degradations all happen often enough to matter, and the software stack has to route around them without restarting the run. Google’s history with v4’s optical reconfigurability suggests they have spent more engineering on this than they typically advertise.

TPU v8i: Google’s inference thesis in silicon

If v8t is the obvious chip — the one that any vendor with a frontier-scale customer would have to build — v8i is the interesting one.

Topology. v8i uses a new fabric Google calls Boardfly, which directly connects 1,152 TPUs in a single pod with denser local connectivity than the ICI topology v8t uses. The design choice here mirrors how serving workloads actually distribute: many small replicas, each handling a slice of traffic, with model parallelism inside a replica and data parallelism across them. Boardfly’s local density supports replica-level coherent serving without paying for cross-pod global topology you don’t need.

On-chip SRAM. This is the headline. v8i ships with roughly 3× more on-chip SRAM than the prior generation. The motivation is the KV cache. More SRAM means more of the active cache lives one bus hop from compute, which compounds into latency wins on every decoded token. The arithmetic is direct: for long-context serving with grouped-query attention, the per-token KV bandwidth is the binding constraint, and SRAM gives you an order of magnitude more bandwidth than HBM at lower energy per byte moved.

Collectives Acceleration Engine. Dedicated hardware for the all-reduce, all-gather, and reduce-scatter patterns that show up inside attention and inside tensor-parallel inference. The CAE is a small, specific piece of silicon, and the point of including it is that those patterns previously consumed a non-trivial fraction of inference cycle time on general-purpose accelerators.

Cost claim. Google reports 80% better perf-per-dollar for inference versus the prior generation. Same caveat as before — this is vs. Ironwood-i, not vs. Blackwell, and the comparison workloads are Google’s choice. But the underlying thesis is defensible: at high-concurrency long-context serving, more SRAM and tighter collectives beat more raw FLOPS, because the binding constraint is memory movement, not arithmetic.

The thing v8i lets Google say to buyers that Blackwell can’t is: you are not paying for training-grade interconnect or training-grade memory pools you don’t use during serving. Whether that translates into a cheaper cost per million tokens at your specific workload depends on the workload — but the architectural argument holds up.

Blackwell: the best case for a unified AI stack

NVIDIA’s bet is the opposite. Blackwell is one silicon family, sold in many configurations, with the software stack adapting to the workload.

The chip. B100 and B200 are single-chip parts. B200 ships with approximately 192 GB HBM3e and roughly 8 TB/s of memory bandwidth per package, with native support for FP4 and FP8 math. Verify exact numbers against NVIDIA’s current Blackwell documentation; these specs evolve with revisions.

The package. A GB200 Superchip pairs two B200s with a Grace CPU on a single board, coherent over NVLink-C2C.

The rack. GB200 NVL72 connects 72 GB200 Superchips into a single NVLink domain via NVSwitch — yielding roughly 30 TB of unified GPU memory and on the order of 130 TB/s of in-rack bandwidth. NVL72 is the unit of comparison against a TPU pod, not a single chip.

The architectural philosophy is straightforward: take the same silicon and let it serve every workload. For training, you allocate more racks. For inference, you allocate fewer. The compiler stack (CUDA, cuDNN, TensorRT-LLM) abstracts the difference. If you can get a workload running on Blackwell at all, you can usually scale it from one GPU to NVL72 without rewriting the model.

Two things sell Blackwell that have nothing to do with peak FLOPS.

NVLink and NVSwitch. Inside an NVL72, the 72 GPUs behave functionally like a single accelerator with unified memory. That matters for serving large models that don’t fit on a single chip — model parallelism across the rack is fast enough that the latency penalty is bearable. It also matters for training, where the all-reduce of gradients across 72 GPUs benefits from the same fabric. The same physical rack serves both.

CUDA. This is the unromantic answer, and it is also the dominant one. Every PyTorch library, every research codebase, every internal pipeline at every customer that has trained a model in the last six years runs on CUDA. The cost of switching off CUDA is not the cost of one engineer; it is the cost of every dependency in your stack having been written for CUDA. JAX, Triton, and OpenAI’s Triton-based kernels are eroding this slowly, but in 2026 it is still the deciding factor for most enterprise buyers.

That, structurally, is NVIDIA’s bet. Same silicon, broader software, lower SKU complexity, the assumption that workload heterogeneity is a software problem.

The architectural philosophy contrast

The two bets are coherent. They are not the same bet.

TPU v8Blackwell
Workload assumptionTraining and inference want different siliconSame silicon, software adapts
SKU strategyTwo distinct parts (v8t, v8i)One family, many configurations
Memory betPooled HBM at training; SRAM at inferenceLarge per-chip HBM; rack-level memory pool via NVLink
InterconnectICI at training; Boardfly at inferenceNVLink + NVSwitch for both
Software stackXLA + JAX + PyTorch-XLA + vLLMCUDA + cuDNN + TensorRT-LLM (broad PyTorch native)
DistributionGCP-first; tighter cloud couplingEvery hyperscaler, every on-prem, every research lab
Vendor philosophyVertical integration: silicon → compiler → frameworkHorizontal dominance: silicon + ecosystem

The honest read on this table is not that one approach is winning. It is that the workload regime is wider than a single architecture can serve gracefully, and the two vendors are now disagreeing in the open about where the design points should be.

What the math actually says about cost

The hardest question — and the one buyers actually need answered — is what a million tokens costs on each.

Google publishes perf-per-dollar claims for v8i. NVIDIA publishes throughput claims for Blackwell. Neither publishes the inputs you need to do an apples-to-apples comparison: workload mix, batch size, context length, model shape, achieved goodput, and amortized rack-level capex. Independent benchmarks of v8 vs. Blackwell on standardized workloads — the kind MLPerf Inference is supposed to produce — are not yet public.

What’s actually known:

For frontier training, the choice is dominated by what model you are training. A dense 70B fits comfortably on either platform; you’d pick on price and availability. A 1.6T MoE benefits materially from v8t’s 2 PB pooled memory, because the alternative is more aggressive sharding on NVL72 — workable, but engineering-expensive. If you are Anthropic or DeepMind, the v8t architecture is structurally aligned with your workload.

For enterprise inference, the math is more genuinely contested. v8i’s SRAM and CAE are the right design choices for high-concurrency long-context serving. B200’s memory bandwidth and TensorRT-LLM’s kernel maturity are competitive on the same workload. Public data is not yet detailed enough to call which wins at typical workloads, but they are within striking distance of each other, and the 80% perf-per-dollar claim Google is making is against Ironwood, not against Blackwell.

The lock-in tax is real on both sides. Migrating from CUDA to TPU costs months of engineering for any non-trivial workload — porting custom kernels, rebuilding tooling, retraining ops people on JAX or PyTorch-XLA, qualifying new monitoring. Migrating the other way costs the same. The difference is that NVIDIA’s footprint is broader — on AWS, Azure, GCP, on-prem, and at every research lab — while TPU is GCP-only. For a buyer who values multi-cloud portability, that asymmetry weighs heavily on Blackwell’s side. For a buyer who is already deep in Google Cloud, it weighs the other way.

The lock-in cost is in engineering time, not chip-hour price. This is the line that gets lost in most TCO discussions. The chip-hour rate is a sticker price. The cost of moving a serious workload is engineers, weeks, and the opportunity cost of what those engineers could have shipped instead.

For context on what the dollar scale of these decisions has become, Meta is spending $145B on AI infrastructure this year and Anthropic’s compute deals with Google and Broadcom run into the hundreds of billions. The frontier customers are not picking on chip-hour price. They are picking on architecture fit and supply.

The specialization wave is bigger than Google vs. NVIDIA

If Google’s v8 split were the only example of workload specialization, it would be a Google strategy. It is not.

AWS Trainium and Inferentia. Amazon shipped the same conceptual split before Google did. Trainium for training, Inferentia for inference. The chips are behind Google’s and NVIDIA’s on raw performance, but the architectural philosophy is the same, and AWS’s distribution is the biggest single channel for non-NVIDIA accelerators.

Groq LPU. Inference-only, deterministic compute, ultra-low time-to-first-token. Wins decisively on latency for short-context workloads; loses on cost per token at high concurrency. The right chip for voice agents and streaming applications; the wrong chip for batch serving.

Cerebras WSE-3. Single-wafer chip, training-focused. A different bet on memory pooling — instead of pooling across many chips, pool everything onto a single 8.5”×8.5” wafer with on-die SRAM in the gigabytes. Niche but real, particularly in scientific computing and certain biology workloads.

SambaNova, Tenstorrent, Etched (Sohu). Specialty silicon for transformer-only inference, MoE-specialized workloads, or memory-disaggregated training. Each picks one design point in the workload space and over-optimizes for it.

The market structure is now visibly fragmenting. NVIDIA holds generality; Google and AWS run the specialization wave at hyperscale price points; a half-dozen startups own niche workloads. The mental model “AI accelerator = NVIDIA” was accurate in 2022 and is no longer accurate in 2026.

What this means for builders

The decision-useful takeaway is not “which chip is best.” It is “which chip fits the workload you actually run.”

For training:

  • If you are not at frontier scale — that is, if you are fine-tuning, training adapters, or pretraining models below the 100B-parameter range — the choice is dominated by your existing software stack. Don’t switch off CUDA without a strong reason. The engineering bill is rarely worth it at this scale.
  • If you are at frontier scale and your binding constraint is fitting state in coherent memory, v8t is a credible option, especially if you are already in GCP. The 2 PB pooled-memory claim is the right design for trillion-parameter MoE training.
  • The TPU compiler stack — XLA, JAX, PyTorch-XLA — has matured significantly. The “TPU means JAX or nothing” framing is no longer accurate, but PyTorch on TPU is still not as ergonomic as PyTorch on Blackwell.

For inference:

  • If your workload is short-context, latency-bound voice or streaming, look at Groq first. Nothing else is in the same ZIP code on time-to-first-token.
  • If your workload is high-concurrency long-context serving — RAG over 100K-token documents, agentic loops over million-token state — v8i’s SRAM advantage is real, and so is Blackwell’s CUDA-native software path. Pilot both before committing.
  • If your workload is general-purpose API serving for a model someone else trained, the no-brainer is Blackwell on whichever cloud you already use. CUDA and TensorRT-LLM are mature, the kernels are heavily optimized, and the supply situation has finally normalized.

For both:

  • The question shifted in 2026 from “can I get an H100?” to “which chip family is right for my workload?” — and the answer is no longer obvious. Build the workload profile first. Pick the silicon second.
  • Don’t be seduced by single-chip peak FLOPS numbers. Look at sustained perf at your batch size, your context length, your concurrency. Vendor benchmarks are vendor benchmarks.
  • Treat the question of cloud portability as a strategic input, not an afterthought. TPU is GCP-only; Blackwell ships everywhere. That asymmetry is sometimes the deciding factor.

A note on what’s not in this article

Two things this piece deliberately doesn’t try to settle.

The first is the MLPerf head-to-head. Public TPU v8 vs. Blackwell numbers on a standardized benchmark suite are not yet available at the time of writing. When they appear, they will be more authoritative than any vendor-vs-prior-gen claim, including the ones cited above. Update your model when the data lands; don’t decide before it does.

The second is pricing. Google has not yet published public TPU v8 pricing for either SKU. The perf-per-dollar claims compare against the vendor’s own prior generation, not against Blackwell on equivalent workload. Anyone modeling a real TCO comparison today is doing it on incomplete inputs.

Those are not small caveats. They are the load-bearing pieces of an honest comparison, and they are not yet public.

Closing

The TPU v8 split is the loudest signal of a wider move that AWS, Groq, Cerebras, SambaNova, Tenstorrent and others have been making for two years: AI silicon is fragmenting along the workload axis, and the fragmentation looks structural rather than transient.

NVIDIA’s bet that workload heterogeneity is a software problem has won so far. CUDA is the most valuable software moat in the industry, and Blackwell’s NVLink fabric is good engineering by any measure. But the unit economics of inference at scale are increasingly the binding constraint for anyone serving large models, and the silicon design space for inference is meaningfully different from the design space for training. Google is the first hyperscaler willing to make that argument in product form rather than in research papers.

The practical move for builders is unromantic: profile the workload, pick the silicon that fits it, and budget the engineering cost of any migration with the same seriousness you’d give a database swap. “Which GPU?” was the right question in 2023. In 2026, the right question is longer — chip family, cloud, software stack, workload — and the people answering it carelessly will overpay or get stuck.

Sources

Want to discuss this topic?

We'd love to hear about your specific challenges and how we might help.