Probing the Pitfalls: Understanding SVD’s Shortcomings in Language Model Compression

Keywords: factorization-based compression, large language model optimization, linguistic representation probing, resource-efficient NLP

Abstract

Background: Modern computational linguistics heavily relies on large language models that demonstrate strong performance in various Natural Language Inference (NLI) tasks. These models, however, require substantial computational resources for both training and deployment. To address this challenge, a range of compression and acceleration techniques has been developed, including quantization, pruning, and factorization. Each of these approaches operates differently, can be applied at various levels of the model architecture, and is suited to different deployment scenarios.

Purpose:  The objective of this study is to analyze and evaluate a factorization-based compression technique that reduces the computational footprint of large language models while preserving their accuracy in NLI tasks, particularly for resource-constrained or latency-sensitive applications.

Method: To evaluate the impact of factorization-based compression, we conducted probing experiments. First, we chose a widely-used pre-trained model (Bert-base and Llama 2) as our baseline. Then, we applied low-rank factorization to its transformer layers using various singular value decomposition algorithms at different compression rates. After that, we used probing tasks to analyze the changes in the internal representations and linguistic knowledge of the compressed models. We compared the changes in the model's internal representations with its ability to solve natural language inference (NLI) tasks and the compression rate achieved through factorization.

Results: Naive uniform factorization often led to significant accuracy drops, even at small compression rates, reflecting a noticeable degradation in the model's ability to understand textual entailments. Probing tasks showed that these uniformly compressed models lost important syntactic and semantic information, which aligned with the performance decline we observed. However, targeted compression approaches, such as selectively compressing the most redundant parts of the model or weighting algorithms, mitigated these negative effects.

Conclusion:  These results demonstrate that factorization, when used properly, can significantly reduce computational requirements while preserving the core linguistic capabilities of large language models. Our research can inform the development of future compression techniques that adapt factorization strategies to the inherent structure of models and their tasks. These insights can help deploy LLMs in scenarios with limited computational resources.

Downloads

Download data is not yet available.

References

Belinkov, Y. (2021). Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1), 207–219. https://doi.org/10.1162/COLI_a_00422

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D. (2020). Language Models are few-shot learners. Advances in Neural Information Processing Systems. https://arxiv.org/abs/2005.14165v4

Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y., Wang, Z., & Carbin, M. (2020). The lottery ticket hypothesis for pre-trained BERT networks. Advances in Neural Information Processing Systems. https://arxiv.org/abs/2007.12223v2

Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2018). Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1), 126–136. https://doi.org/10.1109/MSP.2017.2765695

Dettmers, T., Lewis, M., Shleifer, S., & Zettlemoyer, L. (2021). 8-bit Optimizers via block-wise quantization. ICLR 2022 - 10th International Conference on Learning Representations, 8, 105–125. Curran Associates, Inc. https://arxiv.org/abs/2110.02861v2

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1, 4171–4186. Association for Computational Linguistics. https://arxiv.org/abs/1810.04805v2

Ganesh, P., Chen, Y., Lou, X., Khan, M. A., Yang, Y., Sajjad, H., Nakov, P., Chen, D., & Winslett, M. (2021). Compressing large-scale transformer-based models: A case study on BERT. Transactions of the Association for Computational Linguistics, 9, 1061–1080. https://doi.org/10.1162/TACL_A_00413

Guo, Y., Yao, A., & Chen, Y. (2016). Dynamic network surgery for efficient DNNs. Advances in Neural Information Processing Systems (pp. 1387–1395). Morgan Kaufmann Publishers Inc. https://arxiv.org/abs/1608.04493v2

Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). Learning both Weights and Connections for Efficient Neural Networks. Advances in Neural Information Processing Systems (pp. 1135–1143). https://arxiv.org/abs/1506.02626v3

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring Massive Multitask Language Understanding. ICLR 2021 - 9th International Conference on Learning Representations (pp. 1343–1355). OpenReview.net. https://arxiv.org/abs/2009.03300v3

Hewitt, J., & Liang, P. (2019). Designing and Interpreting Probes with Control Tasks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 2733–2743). Association for Computational Linguistics. https://doi.org/10.18653/v1/d19-1275

Hsu, Y. C., Hua, T., Chang, S. E., Lou, Q., Shen, Y., & Jin, H. (2022). Language model compression with weighted low-rank factorization. ICLR 2022 - 10th International Conference on Learning Representations. https://arxiv.org/abs/2207.00112v1

Ji, Y., Xiang, Y., Li, J., Chen, W., Liu, Z., Chen, K., & Zhang, M. (2024). Feature-based low-rank compression of large language models via bayesian optimization (pp. 844–857). OpenReview.net. https://arxiv.org/abs/2405.10616v1

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). scaling laws for neural language models. https://arxiv.org/abs/2001.08361v1

Kim, Y. D., Park, E., Yoo, S., Choi, T., Yang, L., & Shin, D. (2015). Compression of deep convolutional neural networks for fast and low power mobile applications. 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings. OpenReview.net. https://arxiv.org/abs/1511.06530v2

Kurtic, E., Campos, D., Nguyen, T., Frantar, E., Kurtz, M., Fineran, B., Goin, M., & Alistarh, D. (2022). The Optimal BERT surgeon: Scalable and Accurate second-order pruning for Large Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, 4163–4181. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.279

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A lite BERT for self-supervised learning of language representations. 8th International Conference on Learning Representations, ICLR 2020. Curran Associates, Inc. https://arxiv.org/abs/1909.11942v6

Lane, N. D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, L., Qendro, L., & Kawsar, F. (2016). DeepX: A software accelerator for low-power deep learning inference on mobile devices. 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN 2016 - Proceedings. IEEE Press. https://doi.org/10.1109/IPSN.2016.7460664

Li, K., Patel, O., Viégas, F., Pfister, H., & Wattenberg, M. (2023). Inference-Time intervention: Eliciting truthful answers from a Language Model. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2306.03341v6

Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1, 3214–3252. https://doi.org/10.18653/V1/2022.ACL-LONG.229

Michel, P., Levy, O., & Neubig, G. (2019). Are sixteen heads really better than one? Advances in Neural Information Processing Systems, 32. https://arxiv.org/abs/1905.10650v3

Narayanan, D., Phanishayee, A., Shi, K., Chen, X., & Zaharia, M. (2020). Memory-efficient pipeline-parallel DNN training. Proceedings of Machine Learning Research, 139, 7937–7947. https://arxiv.org/abs/2006.09503v3

Sharma, P., Ash, J. T., & Misra, D. (2023). The truth is in there: improving reasoning in Language Models with layer-selective rank reduction. 12th International Conference on Learning Representations, ICLR 2024. OpenReview.net https://arxiv.org/abs/2312.13558v1

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank (pp. 1631–1642). ACM. https://aclanthology.org/D13-1170

Tai, C., Xiao, T., Zhang, Y., Wang, X., & Weinan, E. (2015). Convolutional neural networks with low-rank regularization. 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings. OpenReview.net https://arxiv.org/abs/1511.06067v3

Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O., & Lin, J. (2019). Distilling task-specific knowledge from BERT into simple neural networks. https://arxiv.org/abs/1903.12136v1

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971v1

Touvron, H., Martin, L., Stone, K.R., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D.M., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A.S., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I.M., Korenev, A.V., Koura, P.S., Lachaux, M., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M.H., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. https://arxiv.org/abs/2307.09288v2

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 2017-December, 5999–6009. https://arxiv.org/abs/1706.03762v7

Wang, N., Choi, J., Brand, D., Chen, C. Y., & Gopalakrishnan, K. (2018). Training deep neural networks with 8-bit floating point numbers. Advances in Neural Information Processing Systems, 2018-December, 7675–7684. https://arxiv.org/abs/1812.08011v1

Wang, Z., Wohlwend, J., & Lei, T. (2019a). Structured pruning of Large Language Models. EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 6151–6162. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.496

Wang, Z., Wohlwend, J., & Lei, T. (2019b). Structured pruning of Large Language Models. EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 6151–6162. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.496

Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7, 625–641. https://doi.org/10.1162/TACL_A_00290

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent abilities of Large Language Models. https://arxiv.org/abs/2206.07682v2

Xu, C., Yao, J., Lin, Z., Ou, W., Cao, Y., Wang, Z., & Zha, H. (2018). Alternating multi-bit quantization for recurrent neural networks. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings. OpenReview.net. https://arxiv.org/abs/1802.00150v1

Yin, L., Jaiswal, A., Liu, S., Kundu, S., & Wang, Z. (2023). Pruning small pre-trained weights irreversibly and monotonically impairs “difficult” downstream tasks in LLMs. https://arxiv.org/abs/2310.02277v2

Yu, H., & Wu, J. (2023). Compressing transformers: Features are low-rank, but weights are not! AAAI Conference on Artificial Intelligence, 37, 11007–11015. https://doi.org/10.1609/AAAI.V37I9.26304

Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., & Sun, G. (2023). ASVD: Activation-aware singular value decomposition for compressing Large Language Models. https://arxiv.org/abs/2312.05821v4

Zafrir, O., Larey, A., Boudoukh, G., Shen, H., & Wasserblat, M. (2021). Prune once for all: Sparse pre-trained Language Models. https://arxiv.org/abs/2111.05754v1

Zhang, T., Lin, Z., Yang, G., & De Sa, C. (2019). QPyTorch: A low-precision arithmetic simulation framework. Proceedings - 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (pp. 10–13). Curran Associates Inc. https://doi.org/10.1109/EMC2-NIPS53020.2019.00010

Published
2024-12-30
How to Cite
PletenevS. (2024). Probing the Pitfalls: Understanding SVD’s Shortcomings in Language Model Compression. Journal of Language and Education, 10(4), 85-97. https://doi.org/10.17323/jle.2024.22368