Hi! I’m Siwei Han(韩偲蔚), a senior at Fudan University, majoring in Computer Science and Technology. I am interested in alignment and application of LLMs, VLMs and multimodal models, currently interning in Prof. Huaxiu Yao’s team at UNC.


🔥 News

2025.03 We present MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding! The paper is available on arXiv.

2025.02 MMIE is selected to be presented as an Oral!

2025.01 MMIE is accepted by ICLR 2025!


📝 Publications

†: Equal contribution

Preprint
sym

We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image to solve DocQA problems. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document’s content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Our data and code are available at github.

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Siwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li, Hongtu Zhu, Huaxiu Yaoplaceholderplaceholderplaceholder

ICLR 2025 Oral
sym

In this paper, we introduce MMIE, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With 20K+ examples covering 12 fields and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. We publicly release our benchmark and code on MMIE.

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
Peng Xia†, Siwei Han†, Shi Qiu†, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao

ICML 2024
sym

In this paper, we propose a novel comparison-based CoT generation algorithm that directly identifies the most promising thoughts with the noisy feedback from the LLM. In each round, we randomly pair intermediate thoughts and directly prompt the LLM to select the more promising one from each pair, allowing us to identify the most promising thoughts through an iterative process. To further model the noise in the comparison, we resort to the techniques of ensemble and dueling bandits and propose two variants of the proposed algorithm.

Preprint
sym

In this paper, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models.

GRAPE: Generalizing Robot Policy via Preference Alignment
Zijian Zhang†, Kaiyuan Zheng†, Zhaorun Chen†, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, Huaxiu Yao

Preprint
sym

In this paper, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. This benchmark incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments.

MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation
Haibo Tong†, Zhaoyang Wang†, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Zhongkai Xue, Yiyang Zhou, Peng Xia, Kexin Geng, Mingyu Ding, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

📖 Educations

fduFudan University
  Undergraduate student in Computer Science and Technology. 2021.09 - now GPA: 3.5/4

uncUniversity of North Carolina at Chapel Hill
  Exchange student. 2023.08 - 2023.12 GPA: 3.9/4


💼 Internships

atcAdvantest, China
  R&D Associate Engineer. 2024.01 - 2024.05

uncUniversity of North Carolina at Chapel Hill
  Research Intern(remote). 2024.01 - now


💻 Projects

LLM Reviser Mar. 2024 - Jun. 2024

Leverage a reviser model to enhance the quality of the response from the original LLM and further preference tuning the LLM with the LLM-generated responses as dispreferred answers and the revised answers as preferred answers.

revisor

THSH shell Oct. 2023

This shell can run standard linux shell commands. It also supports creating new built-in commands. It includes features like pipelines, input/output redirection, script input, and debugging.

LLMEVAL - Fudan University’s NLP Laboratory Apr. 2023 - Jun. 2023

Designing questions and evaluation criteria for assessing large language models’ capability of solving questions in Chinese. Mainly resiponsible for story generation and paragraph edition parts.


🗺️ Languages

  • Chinese: Native
  • English: TOEFL 110
  • Japanese: Elementary


🌟 Interests

  • 🎨 Drawing illustration and manga(some of my paintings⬇)

  • 🎹 Improvisational piano playing and piano recomposition
  • 🎮 Games
    • Baldur’s Gate 3
    • Divinity: Original Sin 2
    • The Legend of Zelda: Breath of the Wild
    • The Elder Scrolls V: Skyrim
  • 🎶 Musicals
    • The Phantom of the Opera
    • Elisabeth
    • Dracula
  • 🐈 Kitty!




📊 Statistics