NVIDIA Demonstrates Better LLM AI Through LoRA-Tuned Models Optimized in TensorRT-LLM
Minimized-training approach to customizing LLMs aims to lower the barrier to entry, while delivering multiple variants on a single GPU.
NVIDIA engineer Amit Bleiweiss has penned a guide to working with low-rank adaptation (LoRA) for better large language models (LLMs) using the company's open source TensorRT-LLM to accelerate performance on compatible graphics processors.
"Customizing LLMs is a challenging task, often requiring a full training process that is time-consuming and computationally expensive. Moreover, training LLMs requires a diverse and representative dataset, which can be difficult to obtain and curate," Bleiweiss explains. "One promising solution is Low-Rank Adaptation (LoRA), a fine-tuning method that can significantly reduce the number of trainable parameters, the memory requirement, and the training time, while achieving comparable or even better performance than fine-tuning on various NLP [Natural Language Processing] tasks and domains."
Designed to make it easier to fine-tune LLMs without having to go through the whole training process again, LoRA adds low-rank matrices into the LLM and trains only those β leaving the original training weights as-is. There's another advantage to the approach, too: "By loading a single base model together with the low-rank matrices A and B for each respective LoRA tuned variant," Bleiwiess notes, "itβs possible to store thousands of LLMs and run them dynamically and efficiently within a minimal GPU memory footprint."
To demonstrate the concept, Bleiweiss' guide walks through using pre-tuned LLMs from the Hugging Face platform and optimizing them using NVIDIA's recently-released open source TensorRT-LLM library β running both a single version of the model and two LoRA checkpoints to show how it doesn't result in a doubling of the memory requirements as you might expect. "With baseline support for many popular LLM architectures," he claims, "TensorRT-LLM makes it easy to deploy, experiment, and optimize with a variety of code LLMs."
The full guide is now available on the NVIDIA developer blog; those interested in trying it out will require a graphics card compatible with TensorRT-LLM β which means one based on the Volta, Turing, Ampere, Hopper, or Ada Lovelace architectures.