TLDR; a paper focused on position bias — where the judge favor the first option. Then it talks about some mitigation strategies.
Strategies they talked about:
- Multiple Evidence Calibration
- This is highly similar to test-time scaling. Explain the thought process first, then give the score.
- Balanced Position Calibration
- Run it twice, and filter to only examples where both runs yield the same answer.
- Human-in-the-Loop Calibration
- Based on Balanced Position Diversity Entropy (BPDE) — the entropy of the evaluation results, across multiple runs.
- Then, they pick the top 20% of examples with the highest BPDE for human evaluation.