The One-Layer Illusion
Reach for torch.compile and your model gets faster, right? Not always. The Hugging Face team just published part two of their PyTorch profiling series, and it reveals something most practitioners miss: a single nn.Linear layer with bias is already fused in eager mode. There's nothing left for the compiler to optimize.
This matters because it changes how you think about when compilation pays off. The post walks through profiler traces of nn.Linear, then stacks three of them into an MLP to show where fusion boundaries actually lie. If you've ever wondered why some models see massive compile speedups while others barely budge, this is your answer.
The scripts live in a public Hugging Face dataset, and the team ran everything on an NVIDIA A100-SXM4-80GB. You can reproduce the traces yourself using Hugging Face Spaces dev mode or the Jobs pipeline.
What nn.Linear Actually Does Under The Hood
nn.Linear is just a wrapper around matrix multiplication and bias addition. You write y = linear_layer(x), and PyTorch computes y = x @ w.T + b. The weight gets transposed, multiplied with the input, then the bias is added.
Here's the twist: the transpose is metadata-only, and the bias addition never becomes a separate kernel. When you profile an nn.Linear forward pass, you see aten::t (transpose) in the CPU dispatch chain, but it doesn't launch a GPU kernel. It just rewrites the tensor's stride to represent the transposed layout. No data gets copied or reorganized.
The bias addition also disappears from the trace. Instead of seeing separate matmul and add kernels, you get a single aten::addmm call that dispatches to a cuBLAS GEMM kernel with a bias epilogue baked in.
Epilogues: The Secret to Avoiding Memory Traffic
An epilogue is a small computation that happens inside a GEMM kernel right before it writes results back to HBM (the GPU's main memory). Adding a bias, scaling by a constant, or applying an activation are classic epilogues. The point is to avoid a second round-trip to memory.
Memory traffic is expensive. If you did the matmul, wrote the result to HBM, then launched a separate kernel to add the bias, you'd read and write the entire result tensor twice. The epilogue folds that addition into the matmul's writeback phase. One kernel, one memory access pattern.
This is why you don't see aten::add in the profiler trace. nn.Linear calls torch.nn.functional.linear, which calls aten::linear, which notices the bias and dispatches aten::addmm(bias, x, weight) instead of chaining separate ops. The cuBLAS kernel already knows how to handle this.
torch.compile Can't Fuse What's Already Fused
So what happens when you wrap that single nn.Linear in torch.compile? Almost nothing.
The profiler traces show the same cuBLAS GEMM kernel on the GPU, the same aten::addmm on the CPU, and a few extra compile-specific rows (the Torch-Compiled Region overhead). The kernel is identical because eager mode already picked the fused variant. Compile has no work to do.
This is the moment to internalize: compile needs multiple operations to fuse. A single GEMM-with-bias is already optimal. The wins come when you stack operations and let the compiler remove kernel launch boundaries between them.
The team proves this by building an MLP with three nn.Linear layers and activation functions in between. That's where fusion opportunities appear.
The Transpose Vanishes in Compiled Mode
If you compare the eager and compiled traces carefully, there's one subtle difference in the CPU dispatch chain. Eager mode shows aten::linear → aten::t → aten::addmm. Compiled mode skips straight to aten::addmm with no transpose op.
Why? Because the compiled graph pre-computes the transposed layout during compilation and bakes it into the kernel call. Eager mode has to dispatch the transpose dynamically every forward pass, even though it's just metadata surgery. Compile does it once at trace time.
This isn't a speedup you'd measure in isolation—the transpose is CPU-only and nearly free—but it's evidence that compile operates on a different abstraction level. It sees the whole graph, prunes metadata ops, and hands the backend a cleaner instruction stream.
Strides and Views: How Transpose Works Without Copying
A quick detour into PyTorch internals: tensors store data as a flat contiguous array in memory. The shape and stride are metadata that tell PyTorch how to walk that array. A stride of (2, 1) means "step 2 elements to move one row, step 1 to move one column."
Transpose swaps the shape and stride without touching the data. You get a different view of the same underlying memory. No kernel, no copy, just a metadata update on the CPU.
This is why aten::t doesn't show up on the GPU lane in the profiler trace. It's a view operation, not a compute operation.
When Does Compile Actually Help?
The MLP example in the blog post is where things get interesting. Three nn.Linear layers with activations in between create opportunities for vertical fusion: collapsing sequential operations into a single kernel that keeps intermediate results in registers or shared memory instead of writing them to HBM.
The post doesn't show the full MLP profiling results in the excerpt, but the setup is clear. Eager mode will launch separate kernels for each linear layer and each activation. Compile can fuse the activation into the GEMM epilogue, or even fuse multiple GEMMs if the dimensions align.
This is where you see 2-5x speedups in practice. Not from a single layer, but from chains of operations that share data reuse patterns.
Why This Matters For Model Development
If you're building models and reaching for torch.compile as a reflex, you need to understand these boundaries. Compilation is not a magic "make it faster" button. It's a tool for removing kernel launch overhead and memory traffic between operations that can share data.
A single nn.Linear? Already optimal. A stack of them with activations? Compile can help. A whole transformer block with attention, FFN, and layer norms? Massive fusion opportunities.
The profiler traces teach you to see where those opportunities are. Load up torch.profiler, look for chains of small kernels with launch overhead in between, and ask whether those boundaries are necessary. That's where compile earns its keep.
The Bigger Picture: Profiling As A Skill
This series from Hugging Face is doing something rare: teaching people to read profiler traces with the same care you'd apply to reading code. Most ML engineers treat profiling as a black box that spits out percentages. This series shows you the dispatch chain, the kernel boundaries, the memory access patterns.
If you're serious about making models fast—not just throwing more GPUs at the problem—you need this skill. The profiler tells you what's actually happening, not what you think is happening.
And sometimes, like with nn.Linear, what's happening is that PyTorch already did the optimization for you.