idea

Things that we need to settle:

  1. What is the smallest model that exhibit subliminal learning? Should try and replicate in GPT2-small.
  2. Then, how can we probe this model?
    • Maybe we can use circuit tracing? Then we can see how the supernode interact with each other probably need to wait until the CLT weights for gpt2-small is out.
    • Otherwise, maybe we can use whatever method is available before circuit tracing (maybe here: https://distill.pub/2020/circuits/zoom-in/#claim-1-polysemantic)