Originally, we proposed to integrate sparsity into QLoRA [1] by extending the implementations available in HuggingFace’s PEFT or PyTorch’s torchtune libraries. The goal was to implement a form of fine-grained sparsity, Fixed Fan-In (FFI), that would be used to prune one of the factors obtained from the low rank decomposition of pretrained weights. FFI sparsity is amenable to real-world acceleration as previously demonstrated on NVIDIA GPUs and CPU [2-3]. This implementation would enable us to use a higher rank adapter with a fixed overall memory budget or fine-tune even larger models than would otherwise fit onto a single Radeon PRO W7900.
Unfortunately, we have been unable thus far to tune our HIP FFI kernels to a suitable level of performance for integrating into the QLoRA framework, see Figure 1. Currently, our best kernels take about x2 as long as a dense benchmark at 90% sparsity. This is a disappointing result; however, we intend to continue tuning our FFI kernels with AMD’s profilers and will finish implementation of the SQuASh adapter later in the year. On a positive note, the experience gained from our initial attempt has given us more confidence working with the ROCm stack, as we worked through and solved problems related to installation of packages, building docker images for ROCm projects, and development of HIP extensions for PyTorch. We hope to add our findings to existing resources online to improve the process for future developers.
We also expended some effort alpha-testing of the bitsandbytes multi-backend refactor branch which has added support for ROCm. We hope to summarize our discussions of the build issues we encountered into an upcoming PR. Additionally, we developed a deep understanding of the QLoRA implementations in PEFT and torchtune. These efforts will be crucial to delivering the final project and helping others use state-of-the-art neural network algorithms on AMD hardware.
Instructions for building the dockerfile image and running benchmarks on OPT-350M are included in the github repository.
[1] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” presented at the Thirty-seventh Conference on Neural Information Processing Systems, Nov. 2023. Available: https://openreview.net/forum?id=OUIFPHEgJU
[2] M. Lasby, A. Golubeva, U. Evci, M. Nica, and Y. Ioannou, “Dynamic Sparse Training with Structured Sparsity,” presented at the The Twelfth International Conference on Learning Representations, Oct. 2023. Available: https://openreview.net/forum?id=kOBkxFRKTA
[3] E. Schultheis and R. Babbar, “Towards Memory-Efficient Training for Extremely Large Output Spaces -- Learning with 500k Labels on a Single Commodity GPU,” Jun. 06, 2023, arXiv: arXiv:2306.03725. doi: 10.48550/arXiv.2306.03725.
Comments