Xin Kong

Xin Kong (孔昕)

I'm currently a Research Scientist at NVIDIA Cosmos Lab, working on foundation models for Physical AI / Robotics.

I obtained my PhD (Thesis: "Scalable 3D Generative Reconstruction for Spatial AI") from Dyson Robotics Lab at Imperial College London, supervised by Prof. Andrew J. Davison. Master degree (Thesis: "Deep Point Cloud Semantic Segmentation and its Application in Robotics") from Zhejiang University, supervised by Prof. Yong Liu. B.Eng in Control Science and Engineering from Harbin Institute Of Technology.

I was a research intern at Meta with Ethan Weber and Peter Kontschieder, and Google Zurich with Federico Tombari and Daniel Watson (DeepMind). During my undergraduate, I was a team member of computer vision group in Harbin Institute of Technology Competition Robotics Team (HITCRT).

Email / Google Scholar / Github / LinkedIn / Twitter

Research

I'm working on World Models for Physical AI, exploring the scalable recipe for Robotics.

Publications

	Cosmos 3: Omnimodal World Models for Physical AI NVIDIA. Contributed to Action Modality. arXiv preprint, 2026. paper / project / code Cosmos 3 is an omnimodal world model for Physical AI that unifies understanding, generation, simulation, and action across language, images, video, audio, and robot actions in a single architecture.
	MLP Splatting: Object-Centric Neural Fields Shinjeong Kim, Yuzhou Cheng, Xin Kong, Paul H. J. Kelly, Andrew J. Davison arXiv preprint*, 2026. paper / project MLP-Splatting decomposes scenes into a few object-centric light-field primitives, each an independent compact MLP with localized spatial support, enabling photorealistic novel-view synthesis and interactive object-level editing from RGB supervision alone.
	KV-Tracker: Real-Time Pose Tracking with Transformers Marwan Taher, Ignacio Alzugaray, Kirill Mazur, Xin Kong, Andrew J. Davison IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2026. paper / project / video KV-Tracker caches key-value pairs from multi-view geometry transformers to enable real-time 6-DoF pose tracking and online scene reconstruction from monocular RGB, achieving up to 15× speedup and ~27 FPS without drift or catastrophic forgetting.
	CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis Xin Kong, Daniel Watson, Yannick Strümpler, Michael Niemeyer, Federico Tombari arXiv preprint, 2025. paper CausNVS is an autoregressive diffusion model for next novel view synthesis with relative pose encoded attention (CaPE) and efficient KV cache inference, towards real-time world modelling, AR streaming and interactive online generation.
	EscherNet: A Generative Model for Scalable View Synthesis Star Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, Andrew J. Davison IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024. Seattle WA, USA. Oral (0.78%) paper / project / code / video / demo EscherNet is a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with the camera positional encoding (CaPE), allowing continuous relative camera control between an arbitrary number of reference and target views.
	vMAP: Vectorised Object Mapping for Neural Field SLAM Star Xin Kong, Shikun Liu, Marwan Taher, Andrew J. Davison IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023. Vancouver, Canada. paper / project / video / code We present vMAP, an object-level real-time mapping system, with each object represented by a separate MLP neural field model, and object models are optimised in parallel via vectorised training.
	Efficient Pedestrian Following by Quadruped Robots Guangyao Zhai, Zhen Zhang, Xin Kong, Yong Liu. IEEE International Conference on Robotics and Automation (ICRA), Workshop on Legged Robots, 2021. Xi'an, China. (Best Extended Abstract Award Finalist) paper / video / certificate We use a quadruped robot to complete a pedestrian-following task in challenging scenarios. The whole system consists of two modules: the perception and planning module, relying on the onboard sensors.
	SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure Lin Li, Xin Kong, Xiangrui Zhao, Yong Liu. IEEE International Conference on Robotics and Automation (ICRA), 2021. Xi'an, China. paper / video We present a novel semantic-aided LiDAR SLAM with loop closure based on LOAM, named SA-LOAM, which leverages semantics in odometry as well as loop closure detection.
	HR-Depth : High Resolution Self-Supervised Monocular Depth Estimation Star Xiaoyang Lyu, Liang Liu, Mengmeng Wang, Xin Kong, etc. The 35th AAAI Conference on Artificial Intelligence (AAAI), 2021. Virtual. paper / code Based on theoretical and empirical evidence, we present HR-Depth, for high-resolution self-supervised monocular depth estimation.
	Semantic Graph Based Place Recognition for 3D Point Clouds Star Xin Kong, Xuemeng Yang, Guangyao Zhai, Xiangrui Zhao, Yong Liu, etc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020. Las Vegas, USA. paper / code / video / presentation We propose a novel semantic graph based approach for large-scale place recognition in 3D point clouds. A novel semantic graph representation and a fast and effective graph similarity network is presented.
	PASS3D: Precise and Accelerated Semantic Segmentation for 3D Point Cloud Xin Kong, Guangyao Zhai, Baoquan Zhong, Yong Liu. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019. Macau, China. paper / video We propose a framework to achieve point-wise semantic segmentation for 3D LiDAR point clouds.

Competitions & Projects

	Zero123-hf: a diffusers implementation of zero123 Star Xin Kong code A Hugggingface Diffusers (merged) implementation of original Zero-1-to-3. Zero-1-to-3 is a large-scale diffusion models that can control the camera perspective, enabling zero-shot novel view synthesis and 3D reconstruction from a single image.
	Awesome Point Cloud Place Recognition Star Xin Kong, Lin Li code A list of papers about point cloud based place recognition, also known as loop closure detection in SLAM.
	ICRA 2018 DJI RoboMaster AI Challenge Team: I Hiter. Xingguang Zhong, Xin Kong, Xiaoyang Lyu, Le Qi, Hao Huang, Linrui Tian, Songwei Li IEEE International Conference on Robotics and Automation (ICRA), 2018. Brisbane, Australia. Global Champion / Ranking: 1st/21 / Certificate / Video / Rules Our team built two fully automatic robots, including machinery, circuit, control and algorithm. I was responsible for visual servo, localization, navigation and decision-making of robots.
	2017 & 2018 RoboMaster Robotics Competition Team: I Hiter. Wei Chen, Yufei Liu, Xin Kong, Xiaoyang Lyu, etc. China University Robot Competition (全国大学生机器人大赛), 2017 & 2018. Shenzhen, China. First Prize / Ranking: 4th/200+ / Certificate / Highlights Our team built more than 10 complex automatic or semi-automatic robots. I was responsible for visual servo, which involves computer vision, RGB-D camera calibration, machine learning, multithreaded programming, ballistic model modeling, etc.
	2017 The Mathematical Contest in Modeling (MCM) Shengqi Li, Xin Kong, Shuaishuai Liu The Consortium for Mathematics and Its Applications (COMAP), 2017. Online. Meritorious Winner (Top 10%) / Paper / Problems Our team modeled the practical problems (Managing The Zambezi River) proposed by COMAP into mathematical models. Through background research, reasonable assumptions and optimization analysis, a solution to the problem was obtained.
	2016 The Contemporary Undergraduate Mathematical Contest in Modeling (CUMCM) Shengqi Li, Xin Kong, Shuaishuai Liu China Society for Industrial and Applied Mathematics (CSIAM), 2016. Online. National Second Prize / Paper / Problems Our team modeled the practical problems (Mooring System Design) proposed by CSIAM into mathematical models. Through background research, reasonable assumptions and optimization analysis, a solution to the problem was obtained.
	2016 The ABU Asia-Pacific Robot Contest (ABU Robocon) Team: HITCRT. Jingyang Wu, Kuan Xu, Xin Kong, etc. Asia-Pacific Broadcasting Union, 2016. Zoucheng, China. National First Prize / Certificate I was a echelon member of the vision group to help the official team members with Ubuntu environment building, camera calibration, and computer vision algorithm testing. Thanks to my seniors for their careful guidance!

Automatic Dustbin Robot based on Kinect v2
Team: HITCRT. Xingguang Zhong, Xin Kong, Chen Yao, Yide Liu, etc.
National Innovation Training Program, 2016. Harbin, China.
Bronze Prize of University Zuguang Cup

Our team designed an automatic dustin robot that can catch objects. I was in charge of Kinect development, RGB-D camera calibration, moving object tracking, and trajectory prediction.

Book Sterilizer based on Automatic Page Turning Device
Xin Kong, Dai Gao, Yiqiu Ding, Jiaming Cui, Jingda Du
College Training Program, 2015. Harbin, China.
National Invention Patent / University-level First Prize

Our team designed and implemented an automatic book sterilizer to protect books by cleaning up the bacteria and dust in books. Patent No. ZL 2015103334672.

Honors

May. 2021, Sun Youxian (Academician of the Chinese Academy of Engineering) Scholarship.

Nov. 2018, Academic Scholarship - Zhejiang University.

May. 2018, Outstanding Graduate - Harbin Institute of Technology.

May. 2018, 3rd Prize of Innovation Scholarship - Ministry of Industry and Information Technology.

Nov. 2016, 8841 Impact Scholarship - Harbin Institute of Technology.

About Me

Skills:PyTorch/TensorFlow/JAX, TPU/GPU Training, Python/C++, Linux, ROS, OpenCV/PCL, Matlab

Languages: Chinese: Native. English: Professional Proficiency.

「Talk is cheap. Show me the code.」

Last update: 2024.02.06. Thanks.