Panwang Pan

Panwang Pan | 潘攀望

Hi, I’m Panwang Pan, a Senior Researcher at ByteDance working at the intersection of generative AI and multimodal learning.

At ByteDance, I turn research into large-scale, production-ready systems built on generative and multimodal models—work that is ongoing.

Email / Google Scholar / GitHub / Twitter / WeChat

💼 Experience

ByteDance — Beijing, China - Full-time Employee.	09/2022 - Present
Alibaba Cloud — Hangzhou, China - Full-time Employee.	07/2019 - 07/2022
NVIDIA Developer Technology — Beijing, China — Intern.	07/2018 - 10/2018

📢 News

[2026-02] Five papers accepted to CVPR 2026; one paper accepted to ICLR 2026.

[2025-09] Six papers accepted to NeurIPS 2025, including one oral presentation.

[2025-06] Released PartCrafter, a structured mesh-generation transformer that synthesizes objects part by part.

[2025-01] Three papers accepted to ICLR 2025, including one Spotlight.

📑 Selected Publications ( Google Scholar )

Research Overview

My recent work spans two themes: multimodal generation and multimodal understanding / agents. On the generation side, I emphasize world models (scene generation) and video generation. On the understanding side, I focus on Jarvis-style systems, agentic workflows, and multimodal reasoning for perception and decision-making.

1. Multimodal Generation

World models are central; papers below are ordered as video generation, then world models (scene generation).

Representative topics: dynamic scenes/worlds and controllable video generation.

2. VLM Multimodal Understanding / Agent

Jarvis series and related agent systems where VLMs interpret instructions, coordinate tools, and improve downstream perception.

Representative topics: photo retouching agents, multimodal planning, and perception-oriented VLM pipelines.

Multimodal Generation

Video Generation

Controllable video synthesis via compositional objectives and language-grounded reward signals.

CVPR 2026

ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation

Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Hengyu Liu, Tingting Shen, Yadong Mu

[Paper] [Project]

ID-Crafter introduces a VLM-grounded online RL framework that uses language feedback to improve compositional multi-subject video generation.

Ring Forcing: Towards Precise Long-Term Memory for Autoregressive Video Diffusion

Bowen Xue, Brandon Y. Feng, Chenguo Lin, Yuchen Lin, Yujia Zeng, Lvmin Zhang, Maneesh Agrawala, Honglei Yan, Panwang Pan^†

[Paper] [Project]

We present Ring Forcing, an autoregressive video diffusion framework designed to robustly construct and precisely utilize long-term memory.

Scene Generation and World Models

Dynamic scene generation and modeling (world models), plus motion-aware representations for controllable environments.

CVPR 2026

Diff4Splat: Controllable Dynamic Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan, Chenguo Lin*, Jingjing Zhao, Chenxin Li, Yuchen Lin, Kairun Wen, Yunlong Lin, Yixuan Yuan, Yadong Mu, Zhiwen Fan

[Paper] [Project] [Code]

Diff4Splat is a generalizable framework for controllable dynamic scene generation from a single image using a video diffusion model.

CVPR 2026

MoVieS: Motion-Aware Dynamic View Synthesis in One Second

Chenguo Lin*, Yuchen Lin*, Panwang Pan† (Project Lead), Yifan Yu, Honglei Yan, Katerina Fragkiadaki, Yadong Mu

[Paper] [Project] [Code]

MoVieS is a feed-forward framework that jointly reconstructs appearance, geometry, and motion for dynamic scene perception from monocular videos.

NeurIPS 2025

DynamicVerse: Physically-Aware Multimodal Modeling for Dynamic Scene Worlds

Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan

[Paper] [Project] [Code]

DynamicVerse is a physically grounded, multimodal framework for modeling dynamic scenes from real-world video.

NeurIPS 2025

PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

Yuchen Lin, Chenguo Lin, Panwang Pan^† (Project Lead), Honglei Yan, Yiqiang Feng, Yadong Mu, Katerina Fragkiadaki

[Paper] [Project] [Code]

PartCrafter is a structured 3D generative model that jointly generates multiple parts and objects from a single RGB image in a single pass.

Multimodal Understanding / Agent

This section covers Jarvis-style systems, agentic workflows, and multimodal understanding modules in which VLMs interpret instructions, coordinate tools, and improve downstream perception and decision-making.

TVCG 2026

ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies

Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, Yuewen Ma

[Paper] [Project]

NeurIPS 2025

JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding‡, Wenbo Li, Shuicheng Yan^‡

[Paper] [Project] [Code]

JarvisArt demonstrates how an agentic VLM can plan and execute photo retouching while preserving content fidelity and following complex instructions.

CVPR 2025

JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration

Yunlong Lin*, Zixu Lin*, Haoyu Chen*, Panwang Pan*, Chenxin Li, Sixiang Chen, Kairun Wen, Yeying Jin, Wenbo Li, Xinghao Ding‡

[Paper] [Project] [Code]

JarvisIR is an agentic VLM-powered system that plans and schedules expert restoration models to improve downstream perception.

ICLR 2026

SAM-Veteran: An MLLM-Based Human-like SAM Agent for Reasoning Segmentation

Tianyuan Du, Haopeng Li, Zhen Fan, Jiarui Zhang, Panwang Pan†, Yang Zhang

[Paper] [Project]

SAM-Veteran is a mask-aware SAM agent that emulates human-like interaction with SAM through a reasoning-driven segmentation workflow.

💬 Miscellaneous

Conference reviewing: NeurIPS, ICLR, CVPR, ICML, ICCV, ACM MM

Co-created with OpenClaw and Codex.