We introduce EscherNet, a multi-view conditioned diffusion model for generative view synthesis. EscherNet offers exceptional generality, flexibility, and scalability within its design — it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, conditioned on any number of reference views with any camera poses.
CVPR 2024 / Oral
Just drop a few images with poses (estimated by Dust3R or any SLAM), EscherNet can generate any number of novel views in 6DoF!
EscherNet is a multiview diffusion model for scalable generative any-to-any novel view synthesis.
We design EscherNet following two key principles: 1. It builds upon an existing 2D diffusion model, inheriting its strong web-scale prior through large-scale training, and 2. It encodes camera poses for each view/image, similar to how language models encode token positions for each token. So our model can naturally handle an arbitrary number of views for any-to-any view synthesis.
The key design component of EscherNet is Camera Positional Encoding (CaPE), to encode camera poses efficiently and accurately within a transformer architecture for image tokens. Inspired by Language domain, each word is a token and the position is encoded via Positional Encoding, e.g. rotary positional embedding (RoPE). We treat each 2D image view as a token and the camera pose as token position encoded by CaPE. Then Novel View Synthesis turns to image-to-image sequence prediction problem, conditioning on the input posed images, predicting the novel views at the queried 3D poses. This enables EscherNet trained on 3-to-3 views, but can generalise to 100-to-100 views.
We show EscherNet as a generative novel view synthesis method, that can synthesise novel views with large-scale priors. Compared to 3D diffusion models (e.g., Zero-1-to-3), EscherNet is more flexible in view numbers and 3D consistency. Compared to scene-specific neural rendering methods (e.g., NeRFs), EscherNet is zero-shot and scene-agnostic. EscherNet is also able to apply on real world object-centric captures.
We show that EscherNet significantly outperforms Zero-1-to-3-XL, despite it being trained on x10 times more training data. Notably, many other 3D diffusion models can only predict fixed target views or only conditioned on a single reference view, while EscherNet can generate multiple consistent target views jointly by taking flexible reference views.
Compared to scene-specific neural rendering methods e.g., InstantNGP and 3D Gaussian Splatting, EscherNet offers plausible view synthesis on out-of-distribution scenes in a zero-shot manner, showing superior rendering quality with few reference views. Though EscherNet gains improvement with an increase in the number of reference views, it starts to lag behind the neural rendering methods.
EscherNet can generate plausible novel views on real-world objects, conditioned on 5 reference views captured by a single camera mounted on a robot arm.
EscherNet's ability to generate dense and consistent novel views significantly improves the reconstruction of complete and well-constrained 3D geometry using NeuS.
Text-to-3D generation is achieved by conditioning EscherNet's input views on the output of off-the-shelf text-to-image generative models, e.g., MVDream and SDXL.
If you found this work is useful in your own research, please considering citing the following.
@article{kong2024eschernet,
title={EscherNet: A Generative Model for Scalable View Synthesis},
author={Kong, Xin and Liu, Shikun and Lyu, Xiaoyang and Taher, Marwan and Qi, Xiaojuan and Davison, Andrew J},
journal={arXiv preprint arXiv:2402.03908},
year={2024}
}
The initial stages of this project yielded some incorrect outcomes; however, we find the emerging patterns aesthetically pleasing and worthy of sharing.