TLDR; Familiarity bias. Average perplexity decreases as score increases. Round number bias β assigning some scores more frequently than others. Anchoring effects β multiple labels are predicted in one output.
Datasets used: SummEval, RoSe.
Mitigations approaches:
- Low glanurality for distinguishing summaries β Widen scores to 1 to 10 scale.
- CoT prompting requires tuning temp β No CoT and set temp to 0.
- Removing source documents impact perf β Keep source even for attributes which donβt require it.
- Multi-attribute labels are highly correlated β Predict only one attribute per generation.