Rectifying Latent Space for Generative Single-Image Reflection Removal

Mingjia Li, Jin Hu, Hainuo Wang, Qiming Hu, Jiarui Wang, Xiaojie Guo
School of Software, Tianjin University, China
arXiv Submission (2025)

Correspondence to Xiaojie Guo (xj.max.guo@gmail.com)

GenSIRR removes strong reflections while preserving details

GenSIRR removes strong, view-dependent reflections across diverse scenes, restoring photorealistic backgrounds and unlocking downstream perception tasks.

Abstract

Single-image reflection removal is highly ill-posed: a captured photograph is a linear superposition of the background transmission and a distracting reflection layer. Existing methods struggle to reason about this composition so they fail on challenging in-the-wild imagery. We reframe a powerful latent diffusion model to properly perceive and process such composite inputs.

GenSIRR introduces a reflection-equivariant VAE that aligns the latent space with the physics of layer formation, a learnable task-specific text embedding that bypasses ambiguous language guidance, and a depth-guided early-branching sampling strategy that selects the most faithful generation path using structural cues. Together, these components enable accurate, high-quality reflection removal, achieving new state-of-the-art performance across benchmarks and real-world photographs.

Method Overview

GenSIRR builds on FLUX.1 Kontext and is trained in two stages. Stage I finetunes a LoRA adapter over the VAE with the equivariance loss to stabilize latent geometry. Stage II freezes the VAE and optimizes both the DiT backbone and the learnable task embedding on a curated mix of real and synthetic reflections. During inference we optionally apply DEBS for sample selection.

  • Training Data: PD-12M for re-VAE, curated Real/Nature/RRW mixtures for DiT.
  • Optimization: AdamW (1e-4 for re-VAE, 1e-5 for DiT) with batch sizes 128 and 32, respectively.
  • Sampling: 28 denoising steps with optional DEBS branching at step 1.
Overview of GenSIRR training pipeline
Two-stage training pipeline with re-VAE alignment, LTE optimization, and optional test-time scaling.

Depth-Guided Early-Branching

Reflections manifest as inconsistent geometry. DEBS evaluates partially denoised candidates with a monocular depth estimator and continues the trajectory that exposes the deepest background.

  • Insight: If reflections vanish in the first denoising step, they stay suppressed throughout sampling.
  • Metric: Global average depth serves as a reference-free quality proxy.
  • Outcome: Deterministic results with significantly fewer retries and only milliseconds of additional compute.
One-step candidate 1
Final selection 1
One-step candidate 2
Final selection 2

One-step latents already reveal which trajectory will succeed.

Benchmark Results

Method Real20 (20) SIR2 (454) Nature (20) Average
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
ERRNet (CVPR'19) 22.89 0.803 23.55 0.882 22.18 0.756 23.47 0.874
IBCLN (CVPR'20) 21.86 0.762 24.20 0.884 23.57 0.783 24.08 0.875
YTMT (NeurIPS'21) 23.26 0.806 24.08 0.890 23.85 0.810 24.04 0.883
Dong et al. (ICCV'21) 23.34 0.812 24.25 0.901 23.45 0.808 24.18 0.894
DSRNet (ICCV'23) 23.91 0.818 25.71 0.906 25.22 0.832 25.62 0.899
Zhu et al. (CVPR'24) 21.83 0.801 25.48 0.897 26.04 0.846 25.37 0.909
DSIT (NeurIPS'24) 25.22 0.836 26.43 0.911 26.77 0.847 26.40 0.905
RDNet (CVPR'25) 25.71 0.850 26.69 0.908 26.31 0.846 26.63 0.903
DAI (AAAI'26) 25.21 0.841 27.47 0.919 26.81 0.843 27.35 0.913
GenSIRR (ours) 27.27 0.871 27.99 0.921 27.30 0.838 27.93 0.916
GenSIRR + DEBS (k=4) 27.58 0.881 28.08 0.937 27.34 0.840 28.03 0.931

We outperform convolutional, transformer, and prior diffusion-based reflection removal systems across all public benchmarks, and Depth-Guided Early-Branching delivers an extra performance boost without retraining.

Human Evaluation

Method OpenRR-val Nature Real20 SIR2
Avg. Success Consensus Avg. Success Consensus Avg. Success Consensus Avg. Success Consensus
RDNet 30.4% 12.0% 63.0% 55.0% 34.0% 5.0% 16.5% 2.2%
DAI 41.2% 20.0% 76.0% 55.0% 33.0% 10.0% 34.2% 9.9%
GenSIRR (ours) 96.6% 87.0% 96.0% 90.0% 91.0% 80.0% 78.5% 56.4%

Five evaluators inspected each sample for objective success (no residual reflections, no deletions, no artifacts). GenSIRR dramatically improves both average success rates and unanimous agreement compared to RDNet and DAI across all datasets.

BibTeX

@article{Li2025GenSIRR,
  title={Rectifying Latent Space for Generative Single-Image Reflection Removal},
  author={Li, Mingjia and Hu, Jin and Wang, Hainuo and Hu, Qiming and Wang, Jiarui and Guo, Xiaojie},
  journal={arXiv preprint arXiv:2507.XXXX},
  year={2025},
  note={Under review},
  url={https://research.mingjia.li/gensirr}
}