name: rl-reward description: > Build RL reward signals using the OpenJudge framework. Covers choosing between pointwise and pairwise reward strategies based on RL algorithm, task type, and cost; aggregating multi-dimensional pointwise scores into a scalar reward; pairwise tournament reward for GRPO on subjective tasks (net win rate across group rollouts); generating preference pairs for DPO/RLAIF; and normalizing scores for training stability. Use when building reward models, scoring rollouts for GRPO/REINFORCE, generating preference data for DPO, or doing Best-of-N selection.
RL Reward Construction with OpenJudge
Build reward signals for reinforcement learning from human feedback (RLHF) and
reinforcement learning from AI feedback (RLAIF) using the openjudge library.
When to Use This Skill
- Building scalar rewards for GRPO / REINFORCE rollout scoring
- Generating (chosen, rejected) preference pairs for DPO / IPO
- Best-of-N candidate selection
- Multi-dimensional reward shaping (correctness + safety + format)
- Replacing or bootstrapping a reward model with LLM-as-judge
Step 1 — Choose Your Reward Strategy
Use this decision tree before writing any code:
RL Algorithm + Task type?
│
├── GRPO / REINFORCE — Verifiable task (math, code, structured output)
│ └── → POINTWISE ✅ (FunctionGrader, exact score, zero LLM cost)
│
├── GRPO / REINFORCE — Subjective task (instruction following, dialogue, summarization)
│ └── → PAIRWISE TOURNAMENT ✅ (compare each rollout vs all others in group,
│ reward = net win rate within group)
│
├── DPO / IPO / SLiC — need (chosen, rejected) pairs
│ └── → PAIRWISE ✅ (two-way comparison, return winner/loser)
│
└── Best-of-N / reranking — rank N candidates
└── → LISTWISE ✅ (single call ranks all N at once)
Cost constraint?
├── Low budget
│ └── FunctionGrader (free) → pointwise; or pairwise with small judge model
│
├── Medium budget
│ └── Pointwise: 2–3 LLM graders + WeightedSumAggregator
│ └── Pairwise tournament: 1 LLM judge, N*(N-1)/2 comparisons per group
│
└── High quality / no cost limit
└── Pointwise voting (3–5 calls) or pairwise with strong judge + debiasing
Sub-documents — Read When Relevant
| Topic | File | Read when… |
|---|---|---|
| Pointwise multi-dim reward | pointwise.md | GRPO on verifiable tasks; multi-dimension scoring |
| Pairwise reward | pairwise.md | GRPO on subjective tasks (tournament); DPO/RLAIF preference pairs |
Read the relevant sub-document before writing any code.
Install
pip install py-openjudge
Strategy Comparison
| Strategy | Output | Reward signal | Typical use | Cost |
|---|---|---|---|---|
| Pointwise | scalar per response | direct reward r(x, y) | GRPO on verifiable tasks, filtering | Low–Medium |
| Pairwise Tournament | net win rate per response | relative reward within group | GRPO on subjective tasks | Medium (N²/2 calls) |
| Pairwise | winner/loser pair | implicit preference y+ > y- | DPO, IPO, RLAIF preference data | Medium |
| Listwise | rank over N responses | ordinal reward / reranking | Best-of-N, reranking | Medium–High |
Score Normalization
All graders return scores on different scales. Always normalize before feeding into RL:
def normalize(score: float, min_score: float, max_score: float) -> float:
"""Map [min_score, max_score] → [0.0, 1.0]."""
if max_score == min_score:
return 0.0
return (score - min_score) / (max_score - min_score)
# LLM graders (common/*) return 1–5 → normalize to 0–1
reward = normalize(result.score, min_score=1, max_score=5)
# FunctionGrader / text graders already return 0–1 → no normalization needed
Evaluation Strategies
Evaluation strategies control how many times a grader is called and how results are aggregated. They are independent of the grader itself.
Choose Your Strategy
Grader type?
│
├── Deterministic (FunctionGrader, StringMatch, CodeExecution, etc.)
│ └── → Direct (zero variance, no need for aggregation)
│
├── LLM grader — Pointwise scoring
│ │
│ ├── Budget limited / speed critical
│ │ └── → Direct (accept variance, 1× cost)
│ │
│ ├── Discrete scores (1–5 integer, pass/fail, binary)
│ │ └── → Voting (majority vote, robust to outliers, N× cost)
│ │
│ └── Continuous / fine-grained scores (need precise ranking)
│ └── → Average (mean, preserves signal, N× cost)
│
└── LLM grader — Pairwise GRPO tournament
└── → GRPOTournament (all-pairs comparison, net win rate)
| Strategy | Aggregation | Best for | Cost |
|---|---|---|---|
DirectEvaluationStrategy | None | Deterministic graders; low budget | 1× |
VotingEvaluationStrategy | Majority vote | Discrete / integer LLM scores | N× |
AverageEvaluationStrategy | Mean | Continuous LLM scores | N× |
GRPOTournamentEvaluationStrategy | Net win rate | Pairwise GRPO on subjective tasks | N²/2× |
All strategies are imported from openjudge.evaluation_strategy.
Pointwise — Noise Reduction with Voting / Average
For high-variance LLM judges, wrap any grader with VotingEvaluationStrategy
to run N calls and take the majority vote:
from openjudge.evaluation_strategy import VotingEvaluationStrategy
grader = CorrectnessGrader(
model=model,
strategy=VotingEvaluationStrategy(num_votes=3, tie_breaker="closest_to_mean"),
)
# Now each call internally runs 3 LLM evaluations and returns the most common score
Use odd num_votes (3, 5) to avoid ties.
Pairwise — GRPO Tournament
For GRPO on subjective tasks, use GRPOTournamentEvaluationStrategy to run
all-pairs comparison and compute net win rate per rollout:
from openjudge.evaluation_strategy import GRPOTournamentEvaluationStrategy
strategy = GRPOTournamentEvaluationStrategy(debiased=False)
results = await strategy.execute(
pairwise_grader.aevaluate,
query="Write a haiku about the ocean.",
responses=["rollout_1", "rollout_2", "rollout_3", "rollout_4"],
)
rewards = [r.score for r in results] # net win rates in [-1.0, 1.0]
Set debiased=True to run each pair in both orders and only count consistent
results (doubles LLM calls but mitigates position bias).