The Chinese University of Hong Kong, Shenzhen

<aside> ✨

TL;DR

Reinforcement Learning (RL) plays a crucial role in enhancing Large Language Models (LLMs) for complex tasks like mathematical reasoning. However, when cold-starting LLMs for RL training, the challenge lies in maintaining output diversity. Traditional supervised fine-tuning (SFT) for cold-start can lead to overfitting, reducing output diversity and restrict the performance improvement in later RL stages. This blog discusses a new approach using GEM, a diversity-preserving fine-tuning algorithm, which helps optimize the RL training process.

👨‍💻 Code, 📈 Wandb Logs,

</aside>

Introduction

Background

Large Language Models (LLMs) have demonstrated remarkable performance in tackling challenging tasks such as translation, summarization, and reasoning. These models undergo extensive pre-training to acquire broad knowledge and are further fine-tuned through post-training to develop specialized capabilities for downstream tasks. Recently, there has been growing interest in adapting LLMs to highly challenging tasks, including competition-level mathematics and code generation, using Reinforcement Learning (RL). Notably, RL-tuned models have shown significant performance improvements that are difficult to achieve with other techniques. Examples of recent advances in this area include OpenAI-o1, DeepSeek-R1, and Kimi-K1.5.

We are particularly interested in RL techniques and believe there is potential to further refine them to fully unlock the power of RL. Achieving strong performance with RL algorithms depends on several key factors. At a high level, we can break down the RL process into two main steps:

exploration: sampling to discover better solutions; and
exploitation: updating the model to learn optimal solutions from the sampled results.

Significant efforts have been devoted to the exploitation stage in recent years. For instance, ReMax was the first to identify key properties of LLMs and design a customized gradient estimator, significantly improving the stability and efficiency of LLM training. Since the work of ReMax, lots of RL algorithms, such as RLOO and GRPO, have been developed to further advance this area.

Unfortunately, the exploration stage has received considerably less attention. While the "classic" RL literature has extensively investigated exploration techniques (see e.g., the survey) , these methods were primarily designed for general RL tasks. As a result, they may be overly complex and introduce unnecessary computational overhead, making them less suitable for large-scale LLM training; see more discussion in the work of ReMax. Consequently, these traditional exploration techniques have not been widely adopted in LLM training. Currently, exploration in LLMs largely relies on random sampling to generate multiple candidate solutions, which is a relatively simplistic approach. In which, the output diversity serves a cricual factor. We note that the importance of output diversity is highlighted in many works of LLMs, for example, the work of self-consistency, test-time compute, planning in LLM Search, etc.

Our Insight

In this blog, we argue that exploration in LLMs differs significantly from classical RL tasks. The key distinction lies in the fact that LLMs are extensively pre-trained, meaning that "exploration" in this context involves sampling responses based on the knowledge already encoded within the model. Therefore, exploration here can be viewed as navigating the vast space of these pre-existing responses to generate new outputs or uncover diverse perspectives.

In contrast, in classical RL tasks, agents start with no prior knowledge of the task and must explore in a completely uncertain environment. This makes exploration in classical RL more challenging, as it requires learning from scratch and updating knowledge solely based on collected data. As a result, extensive research has been dedicated to estimating epistemic uncertainty in environments, particularly within sequential decision-making settings.