How Direct Nash Optimization Improves AI Model Training

Authors:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) Ching-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santacroce, Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].

Table of Links

Abstract and 1 Introduction

2 Preliminaries

2.1 RLHF Based on Reward Models

2.2 RLHF with General Preferences

3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1

3.2 Theoretical Analysis

4 Practical Algorithm – Iterative Contrastive Self-Improvement

5 Experiments and 5.1 Experimental Setup

5.2 Results and Analysis

6 Related Work

7 Conclusion and References

Appendix

A Extension to Regularized Preferences

B Detailed Proofs

C Additional Experimental Details

3 Direct Nash Optimization

While the no-regret update of soft policy iteration used in SPO and Nash-MD has inspired many standard (deep) reinforcement learning algorithms (e.g., NPG, Kakade, 2001; TRPO, Schulman et al., 2015; PPO, Schulman et al., 2017; SAC, Haarnoja et al., 2018), its faithful implementation still usually involves the two-timescale update. This could potentially lead to complex hyperparameter tuning and unstable performance. In this section, we propose a direct and iterative algorithm, Direct Nash Optimization (Algorithm 1), to approximate the Nash equilibrium of MW(P). This algorithm is primarily inspired by SPO. It can be readily adapted to Nash-MD for approximating the Nash equilibrium of MW(Pτ ) with the last-iteration guarantee, and we will discuss this in Appendix A.

3.1 Derivation of Algorithm 1

In most practical algorithms which are inspired by soft policy iteration, including the original practical version of SPO, they typically adopt the following approach: “pushing” π towards this subsequent learning goal in each iteration (we will refer to this as the soft policy iteration target throughout the paper):

However, implementing the above approach typically necessitates on-policy sampling from the current policy π. Ignoring the Zt(x) term could also lead to high variance in the empirical gradient estimation. This is a persistent issue in actor-critic style algorithms that usually suggests the need for an additional baseline (details see, e.g., Mnih et al., 2016), which also requires on-policy estimation. When rt also varies over iterations, as in SPO or Nash-MD, we then need to update all of the policy, baseline, and reward online simultaneously. These challenges have hindered the scalability of existing algorithms which are based on learning the Nash equilibrium of general preference functions.

Monotonic improvement from the batched on-policy updates. One key distinction between DNO and existing algorithms for learning Nash equilibrium (such as SPO and Nash-MD) is that those algorithms aim to approach the Nash equilibrium in a purely on-policy manner, which can be potentially unstable and may need to incorporate two-timescale updates (that change the reward function used in the inner problem more frequently). On the other hand, DNO is a batched on-policy algorithm with single-timescale updates.

3.2 Theoretical Analysis

One of our major proposals is to use a regression-based objective to approximate the explicit soft policy iteration; in this section we show the approximation error from this regression is tightly bounded with finite-sample analysis. The following proposition discusses how well the solution of the regression-based objective (defined in Eq. (12) or Line 4 of Algorithm 1) can approximate the soft policy iteration (Eq. (9)) in terms of the total variation metric at each iteration.

Note that, we present the concentrability coefficient Ct as data-dependent, with πt+1 (learned from data) as part of its definition. We aim to make this guiding the design choices of µ1,t and µ2,t from such Ct for the purpose of sample efficiency. The formal statement and detailed proof of Theorem 1, without involving πt+1, are deferred to Appendix B. Although it shares a similar expression to the concentrability coefficient in offline reinforcement learning (e.g., Chen and Jiang, 2019; Xie et al., 2021), the policies µ1,t and µ2,t are flexible here due to the generative nature of large language models. This flexibility allows for additional intervention, enhancing sample efficiency.

Another interesting observation is that despite Eq. (12)sharing a similar form with Bradley-Terry style reward modeling with using MLE, the target distributions used to measure distribution shift appear to be quite different. This disparity is due to the different objectives: fitting soft policy iteration versus reward estimation. For the Bradley-Terry style reward modeling using MLE, the desired distribution of y1 and y2 should be two distinct distributions (see, e.g., Zhan et al., 2024; Xiong et al., 2023). However, in our case where the learning goal is to fit the soft policy iteration, we may prefer y1 and y2 from two (near) on-policy distributions as discussed above, as long as we expect the learned πt+1 will be accurate enough. To the best of our knowledge, this is the first theoretical result that illustrates the importance of on-policy sampling beyond policy optimization style algorithms for RLHF.

This paper is available on arxiv under CC BY 4.0 DEED license.