RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
:::info Stub — Full Engineering Breakdown Coming This paper was auto-fetched from arXiv on 2026-06-01. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::
| Authors | Ulrich Prestel et al. |
| Year | 2026 |
| Field | Computer Vision |
| arXiv | 2605.31535 |
| Download | |
| Categories | cs.CV, cs.AI, cs.LG |
Abstract
Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder
Engineering Breakdown
The Problem
Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs.
The Approach
We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem.
Key Results
On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches.
Research Areas
This paper contributes to the following areas of AI/ML engineering:
- Image recognition
- Object detection
- Visual transformers
- Convolutional networks
- Multimodal learning
- Selfsupervised
:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::
