| SJTU GC

Dissertation Title: Towards Efficient AI Alignment in Reinforcement Learning and Large Language Models

Date: 2025/11/06 – 2025/11/06

Dissertation Title: Towards Efficient AI Alignment in Reinforcement Learning and Large Language Models

Speaker: Xuening Feng, Ph.D. candidate at UM-SJTU Joint Institute

Time: November 6th from 6:00 -8:00 p.m., 2025 (Beijing Time)

Location: Room 403, Longbin Building

Abstract

With the rapid advancement of artificial intelligence (AI), ensuring that AI systems accurately align with human intentions and values has become increasingly crucial. Misalignment can lead to unintended behaviors that are potentially harmful or ethically problematic, particularly in complex systems such as reinforcement learning (RL) agents and large language models (LLMs). The growing deployment of AI in critical sectors including healthcare, finance, and autonomous driving underscores the urgency of robust and practical alignment methodologies. However, existing alignment methods often face significant efficiency issues, primarily due to high computational costs, extensive human oversight requirements, and challenges in capturing nuanced human preferences. This dissertation addresses these efficiency challenges by proposing novel alignment methods specifically designed for RL and LLM.

For RL, the dissertation advances efficient agent alignment with human intentions through reinforcement learning from human feedback (RLHF). Specifically, it introduces two novel approaches to improve sample efficiency and reduce human effort in RLHF. First, the Diverse, Uncertain, On-policy (DUO) method is proposed for sample-efficient RLHF. DUO strategically enhances query efficiency through principled query generation and selection criteria, ensuring queries are relevant to the current policy, epistemically uncertain, and diverse to avoid redundancy. Additionally, the dissertation proposes a second, complementary framework centered on distinguishability queries, a novel query modality explicitly designed to capture subtle differences in human preference strength. By enabling human evaluators to select between pairs of queries, this method simultaneously reduces cognitive load and increases the informativeness of the collected feedback.

For LLMs, the dissertation presents VESPO, a search-based prompt optimization enhanced by chain-of-thought for cost-efficient human value alignment. VESPO systematically identifies and mitigates risks associated with misalignment, particularly addressing sensitive topics and adversarial prompt inputs. By iteratively refining system prompts using human-like reasoning, VESPO proactively and reactively ensures alignment with ethical standards and cultural sensitivities, avoiding the high computational cost of retraining LLMs and enhancing the adaptability and robustness of the value alignment process.

Extensive experimental evaluations demonstrate that these proposed approaches significantly outperform existing state-of-the-art (SOTA) methods in efficiency and effectiveness. Specifically, DUO demonstrates superior performance across a variety of locomotion and robotic manipulation tasks, markedly reducing the required volume of human feedback while maintaining or improving alignment accuracy. The framework based on distinguishability queries further enhances learning efficiency and evaluator comfort, significantly improving alignment outcomes in user studies and synthetic oracle experiments. Similarly, VESPO demonstrates effectiveness across multilingual and multicultural datasets, robustly managing ethical dilemmas and adversarial attacks, and substantially improving alignment metrics. Collectively, these methods represent a substantial advancement towards practical AI alignment, offering robust, efficient, and scalable solutions to alignment challenges across diverse real-world applications.

Biography

Xuening Feng is a Ph.D. candidate in Shanghai Jiao Tong University Global College, supervised by Professor Yifei Zhu and Professor Pual Weng. Her research focuses on deep reinforcement learning, reinforcement learning from human preferences, and large language model alignment. Her dissertation, “Towards Efficient AI Alignment in Reinforcement Learning and Large Language Models,” investigates efficient methods for aligning intelligent systems with human intentions and values.