|
Panwang Pan | 潘攀望
Hi, I’m Panwang Pan, a Senior Researcher working at the intersection of Generative AI and multimodal learning at PICO, ByteDance.
Previously, I was a Senior Algorithm Engineer at Alibaba Cloud, bridging research and production by deploying models to complex, real‑world systems—from embedded devices to cloud platforms. I led the algorithm deployment for the Aliyun AI‑Box.
I received my M.S. in 2019 from Xiamen University (School of Informatics).
Email
 / 
Google Scholar
 / 
Github
 / 
Twitter
 / 
Wechat
|
|
📢 News
[2026-02] Five papers were accepted to CVPR 2026. One paper was accepted to ICLR 2026.
[2025-09] Six papers, including one oral presentation, were accepted to NeurIPS 2025.
[2025-06] One paper was accepted to ICCV 2025, and we released PartCrafter, a structured mesh generation transformer that generates objects part by part.
[2026-02] One paper was accepted to CVPR 2025.
[2025-01] Three papers, including one Spotlight paper, were accepted to ICLR 2025.
|
|
Research Overview
My recent work is organized into two directions: Multimodal Generation and Multimodal Understanding / Agent. In generation, I prioritize world models (scene generation) and video generation. In understanding, I focus on Jarvis-style systems, agentic workflows, and multimodal reasoning for perception and decision-making.
|
1. Multimodal Generation
World models first, with work presented in the order of video generation and world models (scene generation).
Representative topics: dynamic scenes/worlds and controllable video generation.
|
2. VLM Multimodal Understanding
Jarvis series and related agent systems where VLMs interpret instructions, coordinate tools, and improve downstream perception.
Representative topics: photo retouching agents, multimodal planning, and perception-oriented VLM pipelines.
|
|
|
Multimodal Generation
|
Video Generation
Controllable video synthesis with compositional objectives and language-grounded reward signals.
|
|
Scene Generation and World Models
Dynamic scene generation, dynamic scene modeling (world models), and motion-aware representations for controllable environments.
|
|
|
DynamicVerse: Physically-Aware Multimodal Modeling for Dynamic Scene Worlds
Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan
[Paper]
[Project]
[Code]
DynamicVerse is a physical-scale, multimodal dynamic scene modeling framework for real-world videos.
|
Multimodal Understanding / Agent
|
This track covers Jarvis-style systems, agentic workflows, and multimodal understanding modules where VLMs interpret instructions, coordinate tools, and improve downstream perception and decision-making.
|
NeurIPS 2025
|
JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent
Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding‡, Wenbo Li, Shuicheng Yan‡
[Paper]
[Project]
[Code]
JarvisArt shows how an agentic VLM can plan and execute photo retouching while preserving content fidelity and faithfully following instructions.
|
🏆 Selected Awards
2024, 2023: ByteStyle Innovation Breakthrough Award (ByteDance)
2019: Outstanding Graduate of Xiamen University
2018: National Scholarship for Postgraduates, Ministry of Education (China’s highest scholarship honor)
2018: First Prize of GEDC; Second Prize of MCM & CPIPC
2017: Zhongxian Huang Scholarship, Xiamen University (≈10 awards per year)
2015: National Scholarship for Undergraduates (China’s highest scholarship honor)
|
💬 Miscellaneous
Conference Reviewer: NeurIPS, ICLR, CVPR, ICML, ICCV, ACM MM
|
|