TLDR; Familiarity bias. Average perplexity decreases as score increases. Round number bias β€” assigning some scores more frequently than others. Anchoring effects β€” multiple labels are predicted in one output.

Datasets used: SummEval, RoSe.

Mitigations approaches:

  1. Low glanurality for distinguishing summaries β†’ Widen scores to 1 to 10 scale.
  2. CoT prompting requires tuning temp β†’ No CoT and set temp to 0.
  3. Removing source documents impact perf β†’ Keep source even for attributes which don’t require it.
  4. Multi-attribute labels are highly correlated β†’ Predict only one attribute per generation.