Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Mostly about bias in LLM judges. Introduced a novel framework to measure biases in LLM judges, called CALM. It tests:

Correctness of scientific reasoning
- Verbosity (favoring longer responses)
- Fallacy oversight (ignoring logical errors in reasoning)
- Sentiment (preference for positive and negative expressions) → This is related to the findings we have regarding ACU
Improvement on answer refinement
- Check if the LLM judge favours the refined answer
Alignment with human feedback
- Assess which answer better aligns with human feedback when provided with two or more answers
- Positional bias
- Self-preference (favoring its own generation)

Metrics

LLM judge is executed twice. First, select the better answer, $y$ . Second, two judgement at once:

Robustness rate, measuring how much LLM judge decision remain the same before and after introducing the bias

RR = \frac{1}{∣ D ∣} i = 1 \sum ∣ D ∣ I (y^{i} = \overset{y}{^}^{i})

Consistency rate, measuring how consistent the model’s decision when asked to make the same judgement twice

CR = \frac{1}{∣ D ∣} i = 1 \sum ∣ D ∣ I (y^{i} = y_{rand}^{i})