TLDR; Knowledge transfer. From a larger model to a smaller model. Uses LoRA and all. Without any teacher’s participation.
Knowledge Extraction
This is the only part where the teacher is being used. Let’s say the teacher’s parameter is
We then use the Taylor approximation to find out the effect of each parameter in
We call the above approximated change of loss / sensitivity as
Next, we do dimensionality reduction, which is just choosing the contiguous submatrices (according to the student’s layer dimension) with the highest cumulative sensitivity.
Knowledge Injection
The extracted matrix is decomposed via SVD.
Then, we can approximate the low-rank matrices using the Eckart-Young-Mirsky thm.
We can initialize the row-rank module
Initialization w/ LoRA
Instead of initializing naively, with
In essense, we are doing a “warm start” where the training is informed by the teacher’s knowledge — and we are doing this without disrupting the student’s pretrained weights.
Thoughts
Well there is no “interpretability” on what each parameter is doing. This is only very geared towards knowledge distillation and nothing else.