Is Nvidia's SteerLM the RLHF killer?
Reinforcement learning from human feedback (RLHF) is one of the key reasons behind the success of LLMs such as ChatGPT. RLHF enables LLMs to follow user instructions and align their answers with user goals.
However, RLHF has some fundamental challenges that reduce its efficiency and make it inaccessible to organizations with limited resources.
SteerLM, a new technique developed by researchers at Nvidia, promises to solve the challenges of RLHF and provide better results at lower complexity.
Nvidia has released the code, data, and models, making it possible for other researchers to further improve SteerLM.
Key ideas:
The traditional way to align LLMs is to perform supervised fine-tuning (SFT) followed by RLHF
SFT and RLHF have known limitations, including single-value optimization and complex setup
SteerLM is developed on the idea of conditioning LLMs on multiple-dimensional goals, including quality and attributes such as helpfulness, toxicity, and humor
SteerLM also bootstraps itself by annotating its own training examples as opposed to requiring a large corpus of human-annotated examples
SteerLM works in multiple steps:
First, an attribution prediction model (APM) learns to predict the attributes of model responses
Second, the APM is used to annotate training examples for the main LLM
Third, the main LLM undergoes through SFT with the annotated examples
Fourth, the model samples new responses from prompts, compares its attribute predictions with the APM and repeats the SFT step
Read the full article on TechTalks.
For more on AI research:
ncG1vNJzZmialKmypLTTmqOkq16owqO%2F05qapGaTpLpwvI6hnKudXZ7AbsPHmqtmpZmctbV5wqiknmWRm8Gmvoyro6Ge