Boosting Monocular Metric Depth Estimation via Bokeh Rendering
Abstract
Bokeh rendering and depth estimation share a fundamental optical connection, yet existing methods fail to fully exploit this reciprocity. Conventional bokeh pipelines rely heavily on noisy depth maps that inevitably introduce visual artifacts. Conversely, existing monocular depth models typically follow two flawed paradigms. Generative diffusion-based frameworks often lack consistent metric scale. Meanwhile, feed-forward metric depth models frequently fail in textureless or distant regions where defocus blur can provide geometric information. We propose BokehDepth, a two-stage framework that treats synthetic defocus as a supervision-free geometric signal. In the first stage, a physically grounded generative model produces calibrated bokeh stacks from a single sharp input without requiring prior depth input. Subsequently, a lightweight defocus-aware aggregation module integrates these stacks into the encoder of a depth estimation framework. This mechanism allows the model to extract consistent geometric features from the defocus dimension while keeping the decoder architecture unchanged. Experiments demonstrate that BokehDepth achieves superior visual bokeh fidelity compared to depth-dependent rendering baselines and consistently enhances the metric accuracy of state-of-the-art monocular depth models.