QLoRA (Quantized LoRA)
Last updated
Last updated
QLoRA is an extended version of LoRA designed to reduce the memory requirements of large language models by quantizing the model's weight parameters to 4-bit precision. In typical LLMs, parameters are stored in 32-bit format, but QLoRA compresses them to 4 bits, making the model's memory footprint much smaller. This compression enables the fine-tuning of LLMs on hardware with limited memory, such as consumer-grade GPUs, which were previously unable to handle such models.
Quantizing the Model Weights to 4-bit Precision
The first step in QLoRA is compressing the weight parameters of the pre-trained large language model from the usual 32-bit precision to a more memory-efficient 4-bit format. This reduces the overall memory usage of the model.
Using 4-bit Normal Float for Improved Quantization
Instead of using the standard 4-bit integers or floating-point representations, QLoRA introduces a novel "4-bit Normal Float" data type, optimized for normally distributed data. This approach helps represent the weights of the model, which typically follow a Gaussian distribution. QLoRA applies quantile quantization to divide the data into bins of equal size, ensuring more efficient memory usage and reducing the problem of outliers distorting the distribution.
Applying Double Quantization
To enhance model compression, QLoRA uses a two-step quantization approach. The first level of quantization involves reducing the precision of the weight parameters. The second level quantizes the constants that are used in the first quantization process.
Paged Optimizers for Efficient Memory Management
When training large models, especially on GPUs with limited memory, it's important to manage memory efficiently. QLoRA uses a technique called "paged optimizers" to handle this. It relies on NVIDIA’s unified memory feature, which allows automatic page-to-page transfers between CPU and GPU memory. When the GPU runs low on memory, parts of the optimizer state are moved to CPU memory. These pages are then automatically transferred back to the GPU when they are needed for an optimizer update.
Implementing QLoRA is very similar to LoRA, but with a few minor changes. Here's how it works:
We begin by loading a pre-trained causal language model and applying 4-bit quantization to reduce the model’s memory footprint.
Just like LoRA, we configure the QLoRA parameters through LoraConfig. The main difference here is the inclusion of target_modules, which specifies that QLoRA will be applied to specific attention projection layers (k_proj, v_proj, q_proj) in the model. These layers are crucial for the attention mechanism, and QLoRA adapts them in a low-rank way.
We then use the get_peft_model function to apply the QLoRA configuration to the model.
The model is then set up for training using the Trainer class that we used previously. Here, the optimizer is set to "paged_adamw_8bit", a specialized optimizer for efficient training with quantized models.