Generative AI Templates for Reinforcement Learning with Verifiable Rewards
Large language models (LLMs) are powerful, but their outputs aren't always practical in the real world. To get them to perform better, we fine-tune them with human feedback. This is how we align them with our preferences and mitigate bias.
A common technique for fine-tuning LLMs is Reinforcement Learning with Verifiable Rewards (RLVR). RLVR fine-tunes a model with a clear reward signal, but this process only works when you have a large dataset of human-preferred responses to learn from.
This is where Label Studio becomes valuable. It offers ready-made templates that make it easier to gather preference data. Each template supports a different annotation style and outputs data in a structured format. For example, you can:
- Categorize responses from multiple LLMs into “relevant” and “biased.”
- Compare two answers side by side and pick the better one.
In this article, we'll understand the role of preference data in RLVR, explore Label Studio's generative AI templates, and create our own custom RLVR template for fine-tuning a math tutor bot.
The Role of Human Preference Data in RLVR
Human feedback is essential for RLVR training as it builds a reward model that acts as a scoring system or guide. This reward model is trained on human-annotated data to learn what is a "good" or "bad" response. Its purpose is to automate the process of giving feedback by assigning a numerical score or "reward" to any text generated by the language model.
However, directly defining a reward function that captures human nuance is nearly impossible. The challenge is that human preferences are complex and subjective, making it extremely difficult to encode them in a static reward function.
Templates provide an effective way to collect annotator data that defines what “good” and “bad” responses look like. These comparisons act as direct signals that help train the system to give better answers over time. Templates also make it easier to gather feedback consistently across projects without the need to build custom tools.
After you've gathered feedback, it's used to shape the reward model, which in turn guides the LLM's learning. This iterative feedback loop is the foundation of the RLVR process. The reward model provides a learning signal, which is used by a reinforcement learning algorithm like Proximal Policy Optimization (PPO) to shape the LLM’s behavior. This process allows feedback to be refined and reused across different projects.
Here are the benefits of using human preference data with RLVR:
- Improved Alignment: The models stay closer to human values and preferences, which leads to more desirable and helpful outputs.
- Enhanced Generalization: RLVR helps models generalize better to new situations and prompts, as it learns from a diverse range of human preferences.
- More Intuitive Feedback: Human annotators can easily provide feedback by comparing two options, which is often easier than assigning numeric scores.
Variety of Applications: RLVR is widely used to train language models to generate more helpful responses. It can also train bots and game-playing agents based on human preferences.
Overview of Label Studio’s Generative AI Templates
Label Studio offers a diverse range of Generative AI templates, along with options for Computer Vision, Natural Language Processing, Audio/Speech, and more. These templates facilitate the creation of datasets for fine-tuning models like GPT-4, LLaMA, and PaLM 2.
Here are a few examples of Label Studio’s generative AI templates:
- Supervised LLM Fine-Tuning: Annotators directly write the preferred response.
- Human Preference Collection (RLHF): Annotators compare two responses side by side.
- Chatbot Assessment: Annotators judge a full conversation.
- LLM Ranker: Annotators categorize outputs from multiple models.
Each of these templates is useful depending on where you are in the fine-tuning process.
Creating a RLVR Template for a Math Tutor Bot
Sometimes the existing templates are not enough for a specific fine-tuning project, so we have to create our own. Let’s say we’re training a math tutor bot with RLVR. In this case, we want annotators to mark detailed, explanatory answers as “good” and short or unhelpful answers as “bad”.
If you want to follow along with the code in this example, you can find it at
When you create a custom Label Studio template, you can configure it so the exported dataset already includes reward signals, 1 for “good” answers and 0 for “bad.” That means the export comes with rewards included, so it’s ready to use for RLVR right away. You can build this in the UI or with a few lines of code using our Python SDK.
Defining the Template with XML
Label Studio templates are defined in XML. Here’s the custom config we’ll use:
```
label_config = """
<View>
<Text name="input" value="$prompt"/>
<Text name="output" value="$response"/>
<Choices name="reward" toName="output">
<Choice value="1">Good</Choice>
<Choice value="0">Bad</Choice>
</Choices>
</View>
"""
Launch in Playground
The <View>
tag in this layout acts as the container for the entire labeling interface. The <Text> tags display the input prompt and the model’s output response. The <Choices> tag gives annotators two simple choices:
- Good (1) – for clear explanations
- Bad (0) – for unhelpful answers
The benefit of using Label Studio is that you can inspect and edit the XML behind existing templates as well as build new ones from scratch.
Creating the Project
This creates a new project called RLVR Scoring with our custom template using the SDK. You’ll need to provide `LS_URL`, the url that points to your instance of Label Studio, and an `API_KEY`, the access token that can be found in your user settings.
```
from label_studio_sdk.client import LabelStudio
ls = LabelStudio(
base_url=LS_URL,
api_key=API_KEY,
)
proj = ls.projects.create(
title="RLVR Scoring",
description="Annotate responses for RLVR training.",
label_config=label_config,
color="#FF8800"
)
```
Launch in Playground
Adding Example Tasks
Each task includes a prompt (the math question) and a response (the tutor bot’s answer).
```
tasks = [
{
"prompt": "Solve for x: 2x + 3 = 7",
"response": "Subtract 3 from both sides: 2x = 4, then divide by 2. So, x = 2."
},
{
"prompt": "What is the derivative of x^2?",
"response": "2x"
},
{
"prompt": "Calculate the area of a circle with radius 3",
"response": "To calculate the area of a circle, use the formula Area = π * r^2. With r = 3, Area = π * 3 * 3 = 9π."
}
]
ls.projects.import_tasks(id=pid, request=tasks)
```
Launch in Playground