Mostly about bias in LLM judges. Introduced a novel framework to measure biases in LLM judges, called CALM. It tests:
- Correctness of scientific reasoning
- Verbosity (favoring longer responses)
- Fallacy oversight (ignoring logical errors in reasoning)
- Sentiment (preference for positive and negative expressions) β This is related to the findings we have regarding ACU
- Improvement on answer refinement
- Check if the LLM judge favours the refined answer
- Alignment with human feedback
- Assess which answer better aligns with human feedback when provided with two or more answers
- Positional bias
- Self-preference (favoring its own generation)
Metrics
LLM judge is executed twice. First, select the better answer, . Second, two judgement at once:
- Exactly the same as the first one,
- With bias introduced,
- Robustness rate, measuring how much LLM judge decision remain the same before and after introducing the bias
- Consistency rate, measuring how consistent the modelβs decision when asked to make the same judgement twice
Key Findings
- Position bias increases with more answer candidates
- Some models prefer longer answers, others donβt
- There is a self-enhancement bias
- LLM is distracted by irrelevant content in responses
- LLM is easily convinced by books and quotes, but not really by urls
- LLM prefers content without emotional elements (for revision)
- Some LLMs prefer minority groups
- CoT improves LLMs evaluation accuracy
Discussion
- There is a difference between explicit and implicit bias
- Implicit bias is when the model doesnβt acknowledge the bias in their reasoning
- Explicit bias is when the model knowingly choose the biased answer