TLDR; Created a benchmark for biases, and then dataset that can make LLM more robust against biases. The dataset is called OffsetBias. It is a collection of counter-examples to biases. It contains an instruction , then two responses: good, and bad, . They must be challenging, so must have better qualities than .
Biases they checked:
- Verbosity bias
- Concreteness bias — which is similar to authority bias, but also include examples where judge models favor more complex answers
- Empty reference bias — the judge favors hallucinated instructions
- Content continuation bias — the judge favors one that continues the instruction
- Familiar knowledge bias — favoring well-known facts rather than following the instruction
Their methodology:
- They have two types of producing : (1) Off-topic response method, and (2) Errorneous Response method.
- Off topic response method ⇒ create a different instruction, , and then make a weaker model produce an answer to the original instruction , and a stronger model to produce an answer to .
- Create wrong answer based on specific errors — this is the same as the Can You Trick The Grader paper.
- Then they used some fancy weight merging method instead of normal fine-tuning.
Comments: Lowkey unsure about what is the point of this paper, other than giving me more bias examples.