One shot pruning

# Soft One Shot Stochastic routing This MD file dicuss about the possibility to make pruning ready models. ## Assumptions: **Iteratively pruning** Iteratively pruning is computationally heavy, based on importance scores. It's likely that after each iteration, some layer importance will became significant, the importance is unlikely to be propagated and averaged to the pruned model. **One Shot Block Pruning** one shot pruning suffers because when deleting multiple layers at random points in a network, the mapping function breaks are quite heavy. When removing a single layer, the break of mapping function is likely to be restored with finetuning. But serval breaks at different mapping function points makes the finetuning to restore almost the same mapping functions challenging.(Considering the pretrained model) ## Goals: **soft pruning** softpruning should provide such assistant to the mapping functions so that it reduced the difficulties of weights adaptation. Considering two blocks to be pruned. Pruning first one, finetune, then prune the second one and finetune should be easier than pruning two together. **Soft** means we will prune them together yet ease the process. **Soft One Shot** soft one shot consider pruning all the candidate blocks all at once, but during the training, it simulate the case only part of them are fake pruned, which provides: easier weights adaptation and makes each block more generalized. ## Approach: ![image](https://hackmd.io/_uploads/r1NSPGNyA.png) **Dropout** a random drop out of layers to be pruned, while keeping some of them remained makes the eaiser adaptation of the weights that is kept. Such a fake-prune process reduced the difficulity of finetuning the kept weights because the mapping functions has less breaking points. But which approach is better? 1. Assign each blocks to be pruned with a probability p, increasing from 0 to 1. This simulate the process that eventually the blocks will be pruned. But it might also create multiple breaking points in the mapping functions. 2. Apply one-hot hard mask, which always randomly mask out one block, as a result during iteratively training, all blocks have chace to be fake-pruned, thus the all blocks are still more generalized, and easier to finetune. 3. 50% Drop of some layers. **Property of the soft pruned models**: The model went through such a process is `prune-ready`. Considering pruning half of the blocks, one can physically take the half pruned model. Yet since the soft pruning process makes the whole model generalized, the original model will still have the original performance, or even improved performance because of the generalized weights. It means that there is a range of selection on inference speed vs quality to trade directly when selecting the final model, one can add more blocks based on that, say making some per-layer importance checking, then finetune to the selected model to gain some performance. ### One shot experiments is performed on qwen05. All layers are trainable during the process, but for anchor layers, they don't have 50% chance of drop. New one: Training with bs 4, 20 epochs and cos schedulr for all soft pruning processes. Note: idx refers to the current total blocks, not original 24 if it is already pruned. | layers | metrics: 'all' |ceval| path | |:--------------------------------------------------------|:-------------------------------------------------------- |:-------------- |:-------------------------------------------------------------------- | | 24(original) | 0.96830417 |0.56| | | 24(after process, 8 anchors)[0, 1, 5, 9, 14, 17, 18, 23] | 0.96830417 |0.39| | | 23 (idx:16) | 0.973x || | | 22 (idx:16,15) | 0.97117296 || | | 21 (idx:16,15,21) | 0.9681908 || | | 20 (idx:16,15,21,3) | 0.96419085487 || /nas/people/xin/xian/llm/qwen05_april/1st_stage | | 20 (2nd process, 6 anchors)[0,4,15,13,1,19] * | 0.97117296 || | | 19 (2nd idx: 8) | 0.97117296 || | | 18 (2nd idx: 8, 16) | 0.96918489 || | | 17 (2nd idx: 8, 16, 17) | 0.97316103 || | | 16 (2nd idx: 8, 16, 17, 6) | 0.97017892644 || /nas/people/xin/xian/llm/qwen05_april/8_layers_removed | | 16 (3rd process, 6 anchors)[0, 1, 4, 11, 13, 15] * | 0.9662 || /nas/people/xin/xian/llm/qwen05_april/3rd_state_init/checkpoint-6247 | | 15 (3rd idx: 7) | 0.964x || | | 14 (3rd idx: 7,6) | 0.9612326043 || | | 13 (3rd idx: 7,6, 10) | 0.958250497 || | | 12 (3rd idx: 7,6, 10, 5) | 0.9532803 || | | 11 (3rd idx: 7,6, 10, 5, 2) | 0.937375745 || | ## What's needed In the table above you can see a * mark, it shows that finetuning of the pruned model using the same performance can still bring improvements. It shows that the general knowledge from training properly mapped to inference. If we can have a stand alone step called importance checking step, maybe every 1k step to use a few batches to simulate inference(using training data) to compute loss, then reassign the anchor layers, then we can avoid the iterative solution and post-checking mannual steps. ## Summary 12-04-2024 1. We are lack of close communication from HQ on a higher level, mainly we don't have frequently updated from them. 2. We should check the minimum modules in a localized way. What's the minimum task network that is needed. 3. We should check if we can build the final model based on 0.5B, or 1.8B, the issue is that there is the differences in width from those models, we know with 0.5B we can build a minimum module that works well with the task. It might be the case that with good manner of localization, or merging the weights, we can have less, or more depth comparing to 1.8B, but less width, thus resulting a smaller model. ## Experiments to control ceval | layers | metrics: 'all' |ceval| path | |:--------------------------------------------------------|:-------------------------------------------------------- |:-------------- |:-------------------------------------------------------------------- | | 24(original) | 0.966202783 |0.56| | | 21 | 0.9652087 |0.56166419|/nas/people/xin/xian/llm/qwen18b_april/2nd| | 20 | 0.963x |0.56166419||