63 min listen
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
ratings:
Length:
29 minutes
Released:
Nov 29, 2023
Format:
Podcast episode
Description
In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath's original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.
2023: Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li
https://arxiv.org/pdf/2311.03099v1.pdf
2023: Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li
https://arxiv.org/pdf/2311.03099v1.pdf
Released:
Nov 29, 2023
Format:
Podcast episode
Titles in the series (100)
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World: We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new datas... by Papers Read on AI