TLDR; Familiarity bias. Average perplexity decreases as score increases. Round number bias — assigning some scores more frequently than others. Anchoring effects — multiple labels are predicted in one output.
Datasets used: SummEval, RoSe.
Mitigations approaches:
- Low glanurality for distinguishing summaries → Widen scores to 1 to 10 scale.
- CoT prompting requires tuning temp → No CoT and set temp to 0.
- Removing source documents impact perf → Keep source even for attributes which don’t require it.
- Multi-attribute labels are highly correlated → Predict only one attribute per generation.