TLDR; Knowledge transfer. From a larger model to a smaller model. Uses LoRA and all. Without any teacher’s participation.
Knowledge Extraction
This is the only part where the teacher is being used. Let’s say the teacher’s parameter is , we can isolate each parameter by null-ing every other parameters, .
We then use the Taylor approximation to find out the effect of each parameter in .
We call the above approximated change of loss / sensitivity as . Then, rank all the teacher layer by the summed sensitivity (i.e. ), and select the top layers.
Next, we do dimensionality reduction, which is just choosing the contiguous submatrices (according to the student’s layer dimension) with the highest cumulative sensitivity.
Knowledge Injection
The extracted matrix is decomposed via SVD.
Then, we can approximate the low-rank matrices using the Eckart-Young-Mirsky thm.
We can initialize the row-rank module and .
Initialization w/ LoRA
Instead of initializing naively, with , we do it a bit differently so as to ensure training starts from original pre-trained weights.
In essense, we are doing a “warm start” where the training is informed by the teacher’s knowledge — and we are doing this without disrupting the student’s pretrained weights.
Thoughts
Well there is no “interpretability” on what each parameter is doing. This is only very geared towards knowledge distillation and nothing else.