Can we automatically decide the knob for context sensitivity depending on the task?
Interp
Can we fine-tune LLM on wrong things, and then see if they can find something is wrong through their reasoning?
Where are factual knowledge located at? Can we manipulate it? DoLa paper suggests that it is “localized” into one layer, which is interesting.
Maybe we can reduce the compute for DeepSeek by making it internalize their thoughts instead of saying it out loud? Probably need some interpretability methods to make sure it is still safe.
Is misalignment rely on one direction? Could this be why emergent misalignment occur?