Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

TLDR; Knowledge transfer. From a larger model to a smaller model. Uses LoRA and all. Without any teacher’s participation.

Knowledge Extraction

This is the only part where the teacher is being used. Let’s say the teacher’s parameter is , we can isolate each parameter by null-ing every other parameters, .

We then use the Taylor approximation to find out the effect of each parameter in .

We call the above approximated change of loss / sensitivity as . Then, rank all the teacher layer by the summed sensitivity (i.e. ), and select the top layers.

Next, we do dimensionality reduction, which is just choosing the contiguous submatrices (according to the student’s layer dimension) with the highest cumulative sensitivity.

Knowledge Injection

The extracted matrix is decomposed via SVD.

Then, we can approximate the low-rank matrices using the Eckart-Young-Mirsky thm.

We can initialize the row-rank module and .

Initialization w/ LoRA

Instead of initializing naively, with , we do it a bit differently so as to ensure training starts from original pre-trained weights.

In essense, we are doing a “warm start” where the training is informed by the teacher’s knowledge — and we are doing this without disrupting the student’s pretrained weights.

Thoughts

Well there is no “interpretability” on what each parameter is doing. This is only very geared towards knowledge distillation and nothing else.

Explorer

Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

Knowledge Extraction

Knowledge Injection

Initialization w/ LoRA

Thoughts

Graph View

Table of Contents