Things that we need to settle:
- What is the smallest model that exhibit subliminal learning? Should try and replicate in GPT2-small.
- Then, how can we probe this model?
- Maybe we can use circuit tracing? Then we can see how the supernode interact with each other → probably need to wait until the CLT weights for gpt2-small is out.
- Otherwise, maybe we can use whatever method is available before circuit tracing (maybe here: https://distill.pub/2020/circuits/zoom-in/#claim-1-polysemantic)