Google doesn't talk about its TPU architecture as often as you'd think, given that these chips power everything from Search to Gemini. So when they publish a detailed explainer on what TPUs actually are, it's worth paying attention.
The post includes a video walkthrough that's surprisingly technical for a corporate blog. No hand-waving about "AI magic"—they get into the actual silicon design choices that make TPUs different from GPUs.
Let me break down the highlights and why they matter.
The Matrix Multiplication Problem
Here's the thing about training and running large language models: it's matrix multiplication all the way down. Whether you're doing forward passes, backprop, or inference, you're essentially multiplying enormous matrices together billions of times.
GPUs handle this reasonably well because they're designed for parallel computation. But they're general-purpose accelerators—they have to handle graphics rendering, physics simulations, and a thousand other workloads.
TPUs are purpose-built for exactly one thing: tensor operations. That specialization is the entire point.
Systolic Arrays: The Secret Sauce
The core architectural difference is something called a systolic array. Instead of shuffling data back and forth between compute units and memory (the way GPUs do), TPUs organize their arithmetic units in a grid where data flows through in waves.
Think of it like an assembly line. Each processing element does one multiply-accumulate operation, then passes the partial result to its neighbor. Data pulses through the array rhythmically—hence "systolic," like a heartbeat.
This design minimizes memory access, which is critical. In modern AI workloads, you're not compute-bound—you're memory-bandwidth-bound. Moving data costs more energy and takes more time than the actual computation.
By keeping data flowing through the array without constantly hitting DRAM, TPUs achieve much higher utilization rates. Google claims their systolic arrays can sustain over 90% peak performance on typical transformer workloads.
The Evolution: TPU v1 Through v5
Google's been iterating on this architecture since 2015. The progression is fascinating:
- TPU v1 (2015): Inference-only, INT8 operations, designed for Google's internal models
- TPU v2 (2017): Added training support with FP16/BF16, liquid cooling
- TPU v3 (2018): 2× performance, better interconnect for multi-chip scaling
- TPU v4 (2020): Major leap in scale, used to train PaLM
- TPU v5e and v5p (2023): Current generation, optimized for different workload profiles
The v5 split is particularly interesting. The v5e is the cost-optimized version for inference and smaller training jobs. The v5p is the beast—designed for training frontier models at massive scale.
That product segmentation tells you something about how the market has evolved. In 2015, inference was the primary concern. Now, everyone's racing to train ever-larger models, so you need chips optimized for that specific use case.
The Pod Architecture
TPUs don't work in isolation. Google connects them into "pods" using custom high-bandwidth interconnects. A single TPU v5p pod contains 8,960 chips, delivering over 459 exaFLOPS of BF16 compute.
For context, that's roughly equivalent to the compute used to train GPT-3. In one pod. That you can rent by the hour on Google Cloud.
The interconnect topology matters enormously at this scale. Google uses a 3D torus mesh for TPU v4 and v5, which provides multiple paths between any two chips. This redundancy helps with fault tolerance—when you're running a training job across thousands of chips for weeks, failures are inevitable.
The networking also enables techniques like 3D parallelism, where you split your model across data, pipeline, and tensor dimensions simultaneously. That's how you train 540B-parameter models like PaLM without running out of memory.
Software: JAX and the XLA Compiler
Hardware is only half the story. Google built the XLA (Accelerated Linear Algebra) compiler specifically to target TPUs efficiently. XLA takes your high-level model code and compiles it into optimized TPU instructions.
The tight integration between JAX and TPUs is no accident. JAX's functional programming model—pure functions, no hidden state—maps beautifully onto TPU execution. The compiler can analyze the entire computation graph and make global optimization decisions.
This is different from the PyTorch-on-GPU model, where you're typically running eager execution with manual optimization passes. JAX forces you to structure code in a way that's amenable to aggressive compilation.
There's a learning curve, sure. But if you're training at scale on TPUs, that upfront investment pays off in utilization.
How TPUs Stack Up Against GPUs
The elephant in the room: should you use TPUs or NVIDIA GPUs?
Honest answer: it depends. GPUs have the ecosystem advantage—more frameworks, more tutorials, more Stack Overflow answers. If you're doing research and iterating quickly, that matters.
TPUs have the price-performance advantage for large-scale training, especially if you're already in the Google Cloud ecosystem. The v5e is particularly compelling for inference workloads.
But here's what I find most interesting: the architectural differences are converging. NVIDIA's Hopper architecture added features like the Transformer Engine that look suspiciously TPU-like. Google's TPUs have gotten more flexible and programmable.
We're in a phase where custom AI accelerators are proving their worth, but the designs are cross-pollinating. The next generation of chips—whether from Google, NVIDIA, AMD, or new entrants—will likely blend the best ideas from both approaches.
Why This Matters Beyond Google
Google's willingness to share TPU details isn't purely educational. They're making a play for the training-as-a-service market. If you're a startup that needs to train a large model, renting TPU pods is a viable alternative to buying H100s (assuming you can even get allocation).
The broader implication is about vertical integration in AI. Google controls the entire stack: silicon, compilers, frameworks, and models. That gives them optimization opportunities that nobody else has.
Amazon has Trainium and Inferentia. Microsoft is designing their own chips. Meta built MTIA. Everyone with deep pockets is realizing that custom silicon is a competitive advantage.
We're moving away from the era where you could just assume NVIDIA GPUs would power everything. The next wave of AI progress will happen on a more diverse set of hardware platforms, each optimized for different parts of the workload.
The Takeaway
TPUs represent Google's bet that specialized hardware beats general-purpose accelerators for AI workloads. Eight years in, that bet looks pretty good.
The architectural choices—systolic arrays, custom interconnects, tight compiler integration—reflect deep understanding of what actually bottlenecks large-scale training. It's not always raw FLOPS. It's memory bandwidth, chip-to-chip communication, and utilization under real workloads.
If you're building AI systems at scale, it's worth understanding these tradeoffs. The hardware you choose shapes what's possible, what's efficient, and ultimately what gets built.
And if you're just an AI enthusiast who geeks out about chip architecture (guilty), Google's explainer is a solid addition to the corpus of public knowledge about how these systems actually work. We need more of this.