Panwang Pan | 潘攀望

Hi, I’m Panwang Pan, a Senior Researcher working at the intersection of Generative AI and multimodal learning.

Previously, I was a Senior Algorithm Engineer at Alibaba Cloud, bridging research and production by deploying models to complex, real‑world systems—from embedded devices to cloud platforms. I led the algorithm deployment for the Aliyun AI‑Box.

I received my M.S. in 2019 from Xiamen University (School of Informatics).

Email  /  Google Scholar  /  Github  /  Twitter  /  Wechat

profile photo
📢 News

[2026-02] Five papers were accepted to CVPR 2026. One paper was accepted to ICLR 2026.

[2025-09] Six papers, including one oral presentation, were accepted to NeurIPS 2025.

[2025-06] One paper was accepted to ICCV 2025, and we released PartCrafter, a 3D-native diffusion transformer that generates 3D objects part by part. PartCrafter GitHub stars

[2026-02] One paper was accepted to CVPR 2025.

[2025-01] Three papers, including one Spotlight paper, were accepted to ICLR 2025.

📑 Selected Publications ( Google Scholar )

Research Overview

My recent work is organized into two directions: Multimodal Generation and VLM Multimodal Understanding. In generation, I prioritize scene generation and world models, then extend to video and 3D content creation. In understanding, I focus on Jarvis-style systems, agentic workflows, and multimodal reasoning for perception and decision-making.

1. Multimodal Generation

World models first, with work presented in the order of scene generation, video generation, and 3D content generation.

Representative topics: 4D scenes, dynamic worlds, controllable video generation, meshes, Gaussian splats, and semantic layouts.

2. VLM Multimodal Understanding

Jarvis series and related agent systems where VLMs interpret instructions, coordinate tools, and improve downstream perception.

Representative topics: image restoration agents, photo retouching agents, multimodal planning, and perception-oriented VLM pipelines.

Multimodal Generation

Video Generation

Controllable video synthesis with compositional objectives and language-grounded reward signals.

CVPR 2026

ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation

Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Hengyu Liu, Tingting Shen, Yadong Mu

[Paper] [Project] [Code]

ID-Crafter introduces a VLM-grounded online RL framework that uses language feedback to improve compositional multi-subject video generation.

Scene Generation and World Models

Dynamic scene generation, 4D world modeling, and motion-aware representations for controllable environments.

CVPR 2026

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan, Chenguo Lin*, Jingjing Zhao, Chenxin Li, Yuchen Lin, Kairun Wen, Yunlong Lin, Yixuan Yuan, Yadong Mu, Zhiwen Fan

[Paper] [Project] [Code] Diff4Splat GitHub stars

Diff4Splat is a generalizable framework for controllable 4D scene generation from a single image using a video diffusion model.

ICLR 2025 Spotlight

4K4DGEN: Panoramic 4D Generation at 4K Resolution

Panwang Pan*‡, Renjie Li*, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhengzhong Tu, Zhiwen Fan

[OpenReview] [Paper] [Project] [Code] 4K4DGen GitHub stars

4K4DGEN achieves high-quality panorama-to-4D generation at 4K resolution for the first time using efficient splatting techniques for real-time exploration.

NeurIPS 2025

DynamicVerse: Physically-Aware Multimodal Modeling for Dynamic 4D Worlds

Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan

[Paper] [Project] [Code] DynamicVerse GitHub stars

DynamicVerse is a physical-scale, multimodal 4D modeling framework for real-world videos.

CVPR 2026

MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

Chenguo Lin*, Yuchen Lin*, Panwang Pan†, Yifan Yu, Honglei Yan, Katerina Fragkiadaki, Yadong Mu

[Paper] [Project] [Code] MoVieS GitHub stars

MoVieS is a feed-forward framework that jointly reconstructs appearance, geometry, and motion for 4D scene perception from monocular videos.

3D Content Generation Structured 3D generation spanning meshes, Gaussian splats, humans, and semantic layouts.

NeurIPS 2025

sym
PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

[Paper] [Project] [Code] PartCrafter GitHub stars

PartCrafter is a structured 3D generative model that jointly generates multiple parts and objects from a single RGB image in a single pass.

ICLR 2025

DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splatting Generation

[OpenReview] [Paper] [Project] [Code] DiffSplat GitHub stars

DiffSplat is a 3D generative framework that natively generates 3D Gaussians by repurposing large-scale image diffusion models.

NeurIPS 2024

sym
HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors

[OpenReview] [Paper] [Project] [Code] HumanSplat GitHub stars

HumanSplat predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner.

VLM Multimodal Understanding

This track covers Jarvis-style systems, agentic workflows, and multimodal understanding modules where VLMs interpret instructions, coordinate tools, and improve downstream perception and decision-making.

Jarvis Series and Agentic Understanding Representative VLM systems for multimodal planning, restoration, and creative interaction.

NeurIPS 2025

sym
JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding‡, Wenbo Li, Shuicheng Yan

[Paper] [Project] [Code] JarvisArt GitHub stars

JarvisArt shows how an agentic VLM can plan and execute photo retouching while preserving content fidelity and faithfully following instructions.

CVPR 2025

JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration

[Paper] [Project] [Code] JarvisIR GitHub stars

JarvisIR is a VLM-powered system that dynamically schedules expert restoration models to improve downstream perception.

TPAMI 2025

InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior

[Paper] [Project] [Code] InstructScene GitHub stars

InstructLayout integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 2D and 3D layout synthesis.

💼 Experience

PICO, ByteDance — Beijing, China — Senior Algorithm Engineer
Mentored by Cheng Chen, Zeming Li, and Honglei Yan.
09/2022 - Present
Alibaba Cloud — Hangzhou, China — Senior Computer Vision Algorithm Engineer
07/2019 - 07/2022
DevTech Compute, NVIDIA — Beijing, China — AI Developer Technology Engineer Intern
Advised by Xipeng Li.
07/2018 - 10/2018
🏆 Selected Awards

2024, 2023: ByteStyle Innovation Breakthrough Award (ByteDance)

2019: Outstanding Graduate of Xiamen University

2018: National Scholarship for Postgraduates, Ministry of Education (China’s highest scholarship honor)

2018: First Prize of GEDC; Second Prize of MCM & CPIPC

2017: Zhongxian Huang Scholarship, Xiamen University (≈10 awards per year)

2015: National Scholarship for Undergraduates (China’s highest scholarship honor)

💬 Miscellaneous

Conference Reviewer: NeurIPS, ICLR, CVPR, ICML, ICCV, ACM MM