An update to Gemini diffusion is one of my most eagerly anticipated AI releases. It released to mild fanfare (mostly because you needed to request access to use it), and there has been silence ever since.
Hopefully it's not more Google abandonware, because it was wicked fast and a delight to use
It's not a very promising direction because autoregressive LLMs still deliver better output quality per model weight, as a rule.
Now, is it possible that a model can combine advantages of both? Combine fast generation and multidirectional causality of diffusion with precision, capabilities and generalization of autoregression?
Maybe. This paper is research in that direction. So far, it's not a clear upgrade over autoregressive LLMs.
Diffusion LMs do seem to be able to get more out of the same data. In a world where we are already training transformer based LLMs on all text available, diffusion LMs ability to continue learning on a fixed set of data may be able to outperform transformers
Latency may be better, but throughput (the thing companies care about) may be the same or worse, since every step the entire diffusion window has to be passed through the model. With AR models only the most recent token goes through, which is much more compute efficient allowing you to be memory bound. Trade off with these models is more than one token per forward pass, but idk the point where that becomes worth it (probably depends on model and diffusion window size)
> still deliver better output quality per model weight, as a rule.
is it possible to quantify that and just have a linked slider for quality and speed? If I can get an answer that's 80% right in 1/10th the time, and then iterate on that who comes out ahead?
Yes but you can also do the same thing with autoregressive models just by making them smaller. This tradeoff always exists, the question is whether the Pareto curve for diffusion models ever crosses or dominates the best autoregressive option at the same throughput (or quality).
Unification in logic programming isn't a forwards-only process, so there's no reason to expect deduction in an AI to proceed in a sort of procedural step by step fashion either. What ultimately matters is that all of the various deductions unify coherently in the end.
Not really. The problem is that transformer LLMs are autoregressive and are O(n^2) for self attention and also require insane amounts of bandwidth to “page in” the weights into the relevant compute parts. TPUs do this faster than a CPU like any accelerator but fundamentally this is a challenge. There are attempts to build hardware where the weights are burned into the silicon but that carries other meaningful downsides.
But op is referring to the fact that diffusion is friendlier on both bandwidth and not needing large n^2 compute blocks in the critical path.
In this paper both the diffusion and the auto-regressive models are transformers with O(n^2) performance for long sequences. They share the "Exact KV Cache" for committed tokens.
Diffusion just allows you to spend more compute at the same time so you don't redundantly access the same memory. It can only improve speed beyond the memory bandwidth limit by committing multiple tokens each pass.
Other linear models like Mamba get away from O(n^2) effects, but type of neural architecture is orthogonal to the method of generation.
An update to Gemini diffusion is one of my most eagerly anticipated AI releases. It released to mild fanfare (mostly because you needed to request access to use it), and there has been silence ever since.
Hopefully it's not more Google abandonware, because it was wicked fast and a delight to use
It's not a very promising direction because autoregressive LLMs still deliver better output quality per model weight, as a rule.
Now, is it possible that a model can combine advantages of both? Combine fast generation and multidirectional causality of diffusion with precision, capabilities and generalization of autoregression?
Maybe. This paper is research in that direction. So far, it's not a clear upgrade over autoregressive LLMs.
Diffusion LMs do seem to be able to get more out of the same data. In a world where we are already training transformer based LLMs on all text available, diffusion LMs ability to continue learning on a fixed set of data may be able to outperform transformers
https://arxiv.org/abs/2511.03276
There’s another paper that shows you can get the same effect by training auto regression on Fill in the middle data.
So it’s more about the mask modeling objective than Diffusion.
4-5 times faster with minimal change in quality seems like a clear upgrade in efficiency.
Latency may be better, but throughput (the thing companies care about) may be the same or worse, since every step the entire diffusion window has to be passed through the model. With AR models only the most recent token goes through, which is much more compute efficient allowing you to be memory bound. Trade off with these models is more than one token per forward pass, but idk the point where that becomes worth it (probably depends on model and diffusion window size)
> still deliver better output quality per model weight, as a rule.
is it possible to quantify that and just have a linked slider for quality and speed? If I can get an answer that's 80% right in 1/10th the time, and then iterate on that who comes out ahead?
Yes but you can also do the same thing with autoregressive models just by making them smaller. This tradeoff always exists, the question is whether the Pareto curve for diffusion models ever crosses or dominates the best autoregressive option at the same throughput (or quality).
That's bizarre because I would expect the opposite. For reasoning you go step by step, and when you're done quickly diffuse the answer
Unification in logic programming isn't a forwards-only process, so there's no reason to expect deduction in an AI to proceed in a sort of procedural step by step fashion either. What ultimately matters is that all of the various deductions unify coherently in the end.
Diffusion is favored by current GPUs .
Over time we seem to have a tendency to build models that are well matched to our machines
Are TPUs different?
Not really. The problem is that transformer LLMs are autoregressive and are O(n^2) for self attention and also require insane amounts of bandwidth to “page in” the weights into the relevant compute parts. TPUs do this faster than a CPU like any accelerator but fundamentally this is a challenge. There are attempts to build hardware where the weights are burned into the silicon but that carries other meaningful downsides.
But op is referring to the fact that diffusion is friendlier on both bandwidth and not needing large n^2 compute blocks in the critical path.
In this paper both the diffusion and the auto-regressive models are transformers with O(n^2) performance for long sequences. They share the "Exact KV Cache" for committed tokens.
Diffusion just allows you to spend more compute at the same time so you don't redundantly access the same memory. It can only improve speed beyond the memory bandwidth limit by committing multiple tokens each pass.
Other linear models like Mamba get away from O(n^2) effects, but type of neural architecture is orthogonal to the method of generation.
I've tried dLLMs like Mercury and they look promising.