Rectifying Latent Space for Generative Single-Image Reflection Removal
Correspondence to Xiaojie Guo (xj.max.guo@gmail.com)
Abstract
Single-image reflection removal is highly ill-posed: a captured photograph is a linear superposition of the background transmission and a distracting reflection layer. Existing methods struggle to reason about this composition so they fail on challenging in-the-wild imagery. We reframe a powerful latent diffusion model to properly perceive and process such composite inputs.
GenSIRR introduces a reflection-equivariant VAE that aligns the latent space with the physics of layer formation, a learnable task-specific text embedding that bypasses ambiguous language guidance, and a depth-guided early-branching sampling strategy that selects the most faithful generation path using structural cues. Together, these components enable accurate, high-quality reflection removal, achieving new state-of-the-art performance across benchmarks and real-world photographs.
Method Overview
GenSIRR builds on FLUX.1 Kontext and is trained in two stages. Stage I finetunes a LoRA adapter over the VAE with the equivariance loss to stabilize latent geometry. Stage II freezes the VAE and optimizes both the DiT backbone and the learnable task embedding on a curated mix of real and synthetic reflections. During inference we optionally apply DEBS for sample selection.
- Training Data: PD-12M for re-VAE, curated Real/Nature/RRW mixtures for DiT.
- Optimization: AdamW (1e-4 for re-VAE, 1e-5 for DiT) with batch sizes 128 and 32, respectively.
- Sampling: 28 denoising steps with optional DEBS branching at step 1.
Depth-Guided Early-Branching
Reflections manifest as inconsistent geometry. DEBS evaluates partially denoised candidates with a monocular depth estimator and continues the trajectory that exposes the deepest background.
- Insight: If reflections vanish in the first denoising step, they stay suppressed throughout sampling.
- Metric: Global average depth serves as a reference-free quality proxy.
- Outcome: Deterministic results with significantly fewer retries and only milliseconds of additional compute.
One-step latents already reveal which trajectory will succeed.
Benchmark Results
| Method | Real20 (20) | SIR2 (454) | Nature (20) | Average | ||||
|---|---|---|---|---|---|---|---|---|
| PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |
| ERRNet (CVPR'19) | 22.89 | 0.803 | 23.55 | 0.882 | 22.18 | 0.756 | 23.47 | 0.874 |
| IBCLN (CVPR'20) | 21.86 | 0.762 | 24.20 | 0.884 | 23.57 | 0.783 | 24.08 | 0.875 |
| YTMT (NeurIPS'21) | 23.26 | 0.806 | 24.08 | 0.890 | 23.85 | 0.810 | 24.04 | 0.883 |
| Dong et al. (ICCV'21) | 23.34 | 0.812 | 24.25 | 0.901 | 23.45 | 0.808 | 24.18 | 0.894 |
| DSRNet (ICCV'23) | 23.91 | 0.818 | 25.71 | 0.906 | 25.22 | 0.832 | 25.62 | 0.899 |
| Zhu et al. (CVPR'24) | 21.83 | 0.801 | 25.48 | 0.897 | 26.04 | 0.846 | 25.37 | 0.909 |
| DSIT (NeurIPS'24) | 25.22 | 0.836 | 26.43 | 0.911 | 26.77 | 0.847 | 26.40 | 0.905 |
| RDNet (CVPR'25) | 25.71 | 0.850 | 26.69 | 0.908 | 26.31 | 0.846 | 26.63 | 0.903 |
| DAI (AAAI'26) | 25.21 | 0.841 | 27.47 | 0.919 | 26.81 | 0.843 | 27.35 | 0.913 |
| GenSIRR (ours) | 27.27 | 0.871 | 27.99 | 0.921 | 27.30 | 0.838 | 27.93 | 0.916 |
| GenSIRR + DEBS (k=4) | 27.58 | 0.881 | 28.08 | 0.937 | 27.34 | 0.840 | 28.03 | 0.931 |
We outperform convolutional, transformer, and prior diffusion-based reflection removal systems across all public benchmarks, and Depth-Guided Early-Branching delivers an extra performance boost without retraining.
Human Evaluation
| Method | OpenRR-val | Nature | Real20 | SIR2 | ||||
|---|---|---|---|---|---|---|---|---|
| Avg. Success | Consensus | Avg. Success | Consensus | Avg. Success | Consensus | Avg. Success | Consensus | |
| RDNet | 30.4% | 12.0% | 63.0% | 55.0% | 34.0% | 5.0% | 16.5% | 2.2% |
| DAI | 41.2% | 20.0% | 76.0% | 55.0% | 33.0% | 10.0% | 34.2% | 9.9% |
| GenSIRR (ours) | 96.6% | 87.0% | 96.0% | 90.0% | 91.0% | 80.0% | 78.5% | 56.4% |
Five evaluators inspected each sample for objective success (no residual reflections, no deletions, no artifacts). GenSIRR dramatically improves both average success rates and unanimous agreement compared to RDNet and DAI across all datasets.
BibTeX
@article{Li2025GenSIRR,
title={Rectifying Latent Space for Generative Single-Image Reflection Removal},
author={Li, Mingjia and Hu, Jin and Wang, Hainuo and Hu, Qiming and Wang, Jiarui and Guo, Xiaojie},
journal={arXiv preprint arXiv:2507.XXXX},
year={2025},
note={Under review},
url={https://research.mingjia.li/gensirr}
}