1 Simon Fraser University 2 Independent
Main result (AFHQ‑v1 512×512). Unconditional samples from StyleGAN2‑ADA without RTM (left) vs. with RTM (right). RTM improves both quality (FID 4.79 vs. 4.99) and diversity (Recall 0.565 vs. 0.507).
Despite remarkable progress, image generation is far from solved. The dominant metric, FID, conflates sample fidelity with mode coverage and is close to being saturated. Yet a model can still exhibit mode collapse while achieving a low FID, since a handful of sharp, near‑duplicate images can outscore a model that faithfully covers the full data distribution. We argue that precision and recall are essential complements to FID, and that because FID is already saturated, the more meaningful goal is to improve diversity and coverage. Achieving high recall requires a model that explicitly prioritizes mode coverage, unlike most generative models, which optimize sample fidelity. We introduce RTM, which replaces the single‑pass latent mapping in style‑based generators with an iterative refinement process, and show that this consistently improves both quality and diversity. Integrated with Implicit Maximum Likelihood Estimation (IMLE), which optimizes mode coverage by design, RTM achieves the highest precision and recall among current state‑of‑the‑art approaches while maintaining competitive FID, with improvements across CIFAR‑10, CelebA‑HQ at 256×256, and nine few‑shot benchmarks. RTM also improves StyleGAN2 and StyleGAN2‑ADA on CIFAR‑10 and AFHQ‑v1 at 512×512, demonstrating that the benefit is not specific to IMLE. Unlike flow‑matching baselines that achieve competitive FID at the expense of coverage, recursive refinement improves both quality and diversity simultaneously.
The setup. Style‑based generators such as StyleGAN and RS‑IMLE build images in two stages. A small mapping network turns Gaussian noise z into a style vector w, and a convolutional decoder turns w into an image by modulating each of its layers with w through Adaptive Instance Normalization (AdaIN). The decoder is very sensitive to small changes in w, so where the mapper places w directly determines sample quality.
The bottleneck. In every prior IMLE and StyleGAN model, the mapping network is a plain MLP processed in a single forward pass. The mapper has to settle every aspect of w at once: identity, structure, texture, and fine details. This single‑pass design is a quiet but real ceiling on what the model can produce, especially for diversity.
Our change: refine instead of one shot. RTM replaces the single‑pass mapper with a recursive block adapted from Tiny Recursive Models. The noise is projected into a small set of latent tokens that are then refined through repeated cycles of token‑mixing and channel‑mixing MLPs, sharing weights across cycles. After the last cycle the tokens are projected back into a style vector w.
Why it works. Because the recursive block is shared across cycles, RTM costs almost nothing extra at training time but lets the model spend more compute on the part it really needs to get right: positioning w in style space. When wrapped around RS‑IMLE, RTM inherits IMLE's mode‑coverage‑by‑construction guarantee, since every training image is already paired with a noise vector that decodes near it. RTM improves what IMLE already does well, without giving anything up.
Architecture. Left: the baseline single‑pass MLP mapping network used in StyleGAN and RS‑IMLE. Right: our Recursive Token Mapper, which projects z into tokens, refines them through H shared recursive cycles, and reads out a style vector w. Only the mapper changes; the decoder conditions on w via AdaIN exactly as in the baseline.
Every pair of videos compares the baseline mapper and RTM on the same random latent, so any visible difference is the mapper alone. Each pair shows two complementary things, side by side:
RTM (left). Each frame is the image you would get if you stopped the recursion early and decoded right then. The video walks through successive recursive cycles, so the progression visualizes how the mapper places w as it refines. The last frame is the image RTM actually outputs at inference.
Baseline mapper (right). Each frame shows what the same baseline produces if you keep only the first k MLP layers of its mapping network, for k growing from one layer up to the full network. The video is a layer‑by‑layer build‑up that exposes how much each individual layer changes the output. The last frame is the baseline's full output.
The RTM video sharpens and stabilizes class identities as the cycles progress. With the baseline, nothing recognizable appears until almost the very last layers; only once the full mapping network is in place do the images snap into recognizable objects.
Example 1
RTM (ours)
Baseline mapper
Example 2
RTM (ours)
Baseline mapper
High‑resolution faces. RTM lands on a plausible face within the first few cycles and then refines skin, hair, and background texture. With the baseline, identity, lighting, and pose keep visibly changing as more layers are added; the face only looks coherent at the very end.
Example 1
RTM (ours)
Baseline mapper
Example 2
RTM (ours)
Baseline mapper
The same RTM idea, dropped into a StyleGAN2‑ADA generator instead of an IMLE one. The RTM samples cover a wider palette of coat colors, patterns, and breeds, while the baseline keeps drifting back to similar tabby‑striped grey or brown cats. This is exactly the recall failure RTM is designed to fix.
Example 1
RTM (ours)
Baseline mapper
Example 2
RTM (ours)
Baseline mapper
RTM settles early and refines later. The overall identity of the image appears in the first few cycles, and the remaining cycles polish texture, sharpen edges, and stabilize color. There are no abrupt jumps, just smooth refinement.
The single‑pass mapper is fragile. Adding one MLP layer at a time to the baseline can change identity, lighting, and palette, sometimes producing unnatural intermediate frames. Each layer carries information that is hard to recover from, which is the kind of brittleness RTM avoids.
RTM degrades gracefully. Cutting refinement cycles from RTM produces a slightly less polished version of the same image, not a different image. Every cycle helps, but no single cycle is critical, which is what makes the recursive design more stable in training and more diverse at inference.
@misc{esmaeilzadeh2026onepass,
title = {One Pass Is Not Enough: Recursive Latent Refinement for Generative Models},
author = {Mehdi Esmaeilzadeh and Alexia Jolicoeur-Martineau and Chirag Vashist and Ke Li},
year = {2026},
eprint = {2605.15309},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.15309},
doi = {10.48550/arXiv.2605.15309}
}