PicoBlog

Is Nvidia's SteerLM the RLHF killer?

Reinforcement learning from human feedback (RLHF) is one of the key reasons behind the success of LLMs such as ChatGPT. RLHF enables LLMs to follow user instructions and align their answers with user goals.

However, RLHF has some fundamental challenges that reduce its efficiency and make it inaccessible to organizations with limited resources.

SteerLM, a new technique developed by researchers at Nvidia, promises to solve the challenges of RLHF and provide better results at lower complexity.

Nvidia has released the code, data, and models, making it possible for other researchers to further improve SteerLM.

Key ideas:

  • The traditional way to align LLMs is to perform supervised fine-tuning (SFT) followed by RLHF

  • SFT and RLHF have known limitations, including single-value optimization and complex setup

  • SteerLM is developed on the idea of conditioning LLMs on multiple-dimensional goals, including quality and attributes such as helpfulness, toxicity, and humor

  • SteerLM also bootstraps itself by annotating its own training examples as opposed to requiring a large corpus of human-annotated examples

  • SteerLM works in multiple steps:

    • First, an attribution prediction model (APM) learns to predict the attributes of model responses

    • Second, the APM is used to annotate training examples for the main LLM

    • Third, the main LLM undergoes through SFT with the annotated examples

    • Fourth, the model samples new responses from prompts, compares its attribute predictions with the APM and repeats the SFT step

Read the full article on TechTalks.

For more on AI research:

ncG1vNJzZmialKmypLTTmqOkq16owqO%2F05qapGaTpLpwvI6hnKudXZ7AbsPHmqtmpZmctbV5wqiknmWRm8Gmvoyro6Ge

Delta Gatti

Update: 2024-12-02