Background

Fine-Tune GPT-3: Use supervised learning to train GPT-3 on specific tasks by providing it with correct input-output pairs.
Sample & Prioritize Outputs: Generate multiple outputs for a new prompt using the fine-tuned model, and have a labeler prioritize the best ones.
Train Reward Model: Use the prioritization to train a reward model that understands what constitutes a "good" output.
Generate Response Based on Policy: Use a policy to generate a response to a given prompt.
Evaluate & Update: Use the reward model to evaluate the response, and then update the policy based on the reward. This process refines the policy over time, helping the model generate better responses.

Direct Preference

Direct correspondence between reward model and optimal policy.

select preference data (prompts that produce good outputs) measure log likelihood on LM

final LM be the optimal LM relative to reward function

RLHF objective: for each policy and prompt (x), we maximize reward
- any reward function
Closed form optimal policy: proportional to base model weighted by some function of reward
reward is function of optimal policy
- higher if policy likes response more than reference model,

Loss is now a function of optimal policy. gradient = change in loss with respect to optimal policy weights