Creativity

OFYP

Can we fine-tune LLM on wrong things, and then see if they can find something is wrong through their reasoning?
Where are factual knowledge located at? Can we manipulate it? DoLa paper suggests that it is “localized” into one layer, which is interesting.
Maybe we can reduce the compute for DeepSeek by making it internalize their thoughts instead of saying it out loud? Probably need some interpretability methods to make sure it is still safe.
Is misalignment rely on one direction? Could this be why emergent misalignment occur?
We focus so much on interpreting reasoning; do harmful tasks require reasoning at all? (@korbakChainThoughtMonitorability)
Does Process Supervision negatively impact interpretability?

Is there a link between subliminal learning and polysemanticity?
Maybe we can do an in-depth investigation on the collateral transmission of subliminal learning?

Is there a way to “steer” LMs by scaling the direction of “trigonometry” when calculating sine of something?