We introduce EscherNet, a multi-view conditioned diffusion model for generative view synthesis. EscherNet offers exceptional generality, flexibility, and scalability within its design — it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, conditioned on any number of reference views with any camera poses.
CVPR 2024 / Oral
EscherNet is a multiview diffusion model for scalable generative any-to-any novel view synthesis.
We design EscherNet following two key principles: 1. It builds upon an existing 2D diffusion model, inheriting its strong web-scale prior through large-scale training, and 2. It encodes camera poses for each view/image, similar to how language models encode token positions for each token. So our model can naturally handle an arbitrary number of views for any-to-any view synthesis.
We show EscherNet as a generative novel view synthesis method, that can synthesise novel views with large-scale priors. Compared to 3D diffusion models (e.g., Zero-1-to-3), EscherNet is more flexible in view numbers and 3D consistency. Compared to scene-specific neural rendering methods (e.g., NeRFs), EscherNet is zero-shot and scene-agnostic. EscherNet is also able to apply on real world object-centric captures.
We show that EscherNet significantly outperforms Zero-1-to-3-XL, despite it being trained on x10 times more training data. Notably, many other 3D diffusion models can only predict fixed target views or only conditioned on a single reference view, while EscherNet can generate multiple consistent target views jointly by taking flexible reference views.
Compared to scene-specific neural rendering methods e.g., InstantNGP and 3D Gaussian Splatting, EscherNet offers plausible view synthesis on out-of-distribution scenes in a zero-shot manner, showing superior rendering quality with few reference views. Though EscherNet gains improvement with an increase in the number of reference views, it starts to lag behind the neural rendering methods.
EscherNet can generate plausible novel views on real-world objects, conditioned on 5 reference views captured by a single camera mounted on a robot arm.
EscherNet's ability to generate dense and consistent novel views significantly improves the reconstruction of complete and well-constrained 3D geometry using NeuS.
Text-to-3D generation is achieved by conditioning EscherNet's input views on the output of off-the-shelf text-to-image generative models, e.g., MVDream and SDXL.
If you found this work is useful in your own research, please considering citing the following.
@article{kong2024eschernet,
title={EscherNet: A Generative Model for Scalable View Synthesis},
author={Kong, Xin and Liu, Shikun and Lyu, Xiaoyang and Taher, Marwan and Qi, Xiaojuan and Davison, Andrew J},
journal={arXiv preprint arXiv:2402.03908},
year={2024}
}
The initial stages of this project yielded some incorrect outcomes; however, we find the emerging patterns aesthetically pleasing and worthy of sharing.