Siwei Han

UNC‑Chapel Hill · Fudan University

Hi! I'm Siwei Han (韩偲蔚), a first-year Ph.D. student at the Department of Computer Science, UNC‑Chapel Hill, advised by Prof. Huaxiu Yao. Before that, I received my Bachelor of Science degree at Fudan University.

My research focuses on enhancing multimodal and multi-agent systems built upon large foundation models. I develop methods that combine retrieval-augmented reasoning, cross-modal alignment, and collaborative agent interaction to improve models’ understanding, planning, and decision-making. My work aims to make LLMs and VLMs more reliable and capable in tasks such as document understanding and multimodal reasoning.

Email Google Scholar GitHub LinkedIn X (Twitter)

🔥 News

2025.10 We present Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails!

2025.09 Three papers are accepted by NIPS 2025, including one spotlight!

2025.09 One paper is accepted by EMNLP 2025 Main as an Oral!

2025.03 We present MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding!

2025.02 MMIE is selected to be presented as an Oral!

2025.01 MMIE is accepted by ICLR 2025!

📝 Selected Publications

Full Publications

†: Equal contribution

Preprint

Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

Siwei Han, Jiaqi Liu, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao

We present the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at github.

Preprint

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Siwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li, Hongtu Zhu, Huaxiu Yao

We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image to solve DocQA problems. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Our data and code are available at github.

ICLR 2025 Oral

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Peng Xia†, Siwei Han†, Shi Qiu†, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao

In this paper, we introduce MMIE, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With 20K+ examples covering 12 fields and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. We publicly release our benchmark and code on MMIE.

ICML 2024

Generating Chain-of-Thoughts with a Direct Pairwise-Comparison Approach to Searching for the Most Promising Intermediate Thought

Zhen-Yu Zhang, Siwei Han, Huaxiu Yao, Gang Niu, Masashi Sugiyama

In this paper, we propose a novel comparison-based CoT generation algorithm that directly identifies the most promising thoughts with the noisy feedback from the LLM. In each round, we randomly pair intermediate thoughts and directly prompt the LLM to select the more promising one from each pair, allowing us to identify the most promising thoughts through an iterative process. To further model the noise in the comparison, we resort to the techniques of ensemble and dueling bandits and propose two variants of the proposed algorithm.

EMNLP 2025 Main (Oral)

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

Yiyang Zhou†, Linjie Li†, Shi Qiu†, Zhengyuan Yang, Yuyang Zhao, Siwei Han, Yangfan He, Kangqi Li, Haonian Ji, Zihao Zhao, Haibo Tong, Lijuan Wang, Huaxiu Yao

In this paper, we introduce GLIMPSE, a benchmark designed to evaluate whether large vision-language models (LVLMs) can truly think with videos rather than rely on static frames. Unlike prior video benchmarks that resemble image-based tasks, GLIMPSE emphasizes holistic temporal reasoning through 3,269 videos and 4,342 human-crafted questions across 11 categories. Each question requires full-context understanding of the entire video. While humans achieve 94.82% accuracy, the best LVLM, GPT-o3, reaches only 66.43%, revealing significant gaps in genuine video reasoning.

NIPS 2025 Spotlight

MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation

Haibo Tong†, Zhaoyang Wang†, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Zhongkai Xue, Yiyang Zhou, Peng Xia, Kexin Geng, Mingyu Ding, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

In this paper, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. This benchmark incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments.

Preprint

GRAPE: Generalizing Robot Policy via Preference Alignment

Zijian Zhang†, Kaiyuan Zheng†, Zhaorun Chen†, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, Huaxiu Yao

In this paper, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models.

📖 Educations

unc University of North Carolina at Chapel Hill
Ph.D. student in Computer Science. 2025.08 - Present

fdu Fudan University
Undergraduate student in Computer Science and Technology. 2021.09 - 2025.06

unc University of North Carolina at Chapel Hill
Exchange student. 2023.08 - 2023.12

💻 Internships

atc Advantest, China
R&D Associate Engineer. 2024.01 - 2024.05

unc University of North Carolina at Chapel Hill
Research Intern(remote). 2024.01 - 2025.07

🏆 Selected Honors & Awards

NeurIPS Spotlight Presentation (Top 5%), 2025
EMNLP Oral Presentation, 2025 (Top 2%), 2025
KDD 2025 Health Day Distinguished Vision Award, 2025
ICLR Oral Presentation (Top 1.8%), 2025

💼 Academic Services

Workshop Co-Organizer: ICML 2025 Workshop on Reliable and Responsible Foundation Models

🗺️ Languages

Chinese: Native
English: TOEFL 110
Japanese: Elementary

🌟 Interests

🎨 Drawing illustration and manga (some of my paintings below)

🎹 Improvisational piano playing and piano recomposition
🎮 Games
- Baldur's Gate 3
- Divinity: Original Sin 2
- The Legend of Zelda: Breath of the Wild
- The Elder Scrolls V: Skyrim
- ...
🎶 Musicals
- The Phantom of the Opera
- Elisabeth
- Dracula
- ...
🐈 Kitty!